最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Puppeteer - how to select an element based on its inner text? - Stack Overflow

programmeradmin1浏览0评论

I am working on scraping a bunch of pages with Puppeteer. The content is not differentiated with classes/ids/etc. and is presented in a different order between pages. As such, I will need to select the elements based on their inner text. I have included a simplified sample html below:

<table>
<tr>
    <th>Product name</th>
    <td>Shakeweight</td>
</tr>
<tr>
    <th>Product category</th>
    <td>Exercise equipment</td>
</tr>
<tr>
    <th>Manufacturer name</th>
    <td>The Shakeweight Company</td>
</tr>
<tr>
    <th>Manufacturer address</th>
    <td>
        <table>
            <tr><td>123 Fake Street</td></tr>
            <tr><td>Springfield, MO</td></tr>
        </table>
    </td>
</tr>

In this example, I would need to scrape the manufacturer name and manufacturer address. So I suppose I would need to select the appropriate tr based upon the inner text of the nested th and scrape the associated td within that same tr. Note that the order of the rows of this table is not always the same and the table contains many more rows than this simplified example, so I can't just select the 3rd and 4th td.

I have tried to select an element based on inner text using XPATH as below but it does not seem to be working:

var manufacturerName = document.evaluate("//th[text()='Manufacturer name']", document, null, XPathResult.ANY_TYPE, null)

This wouldn't even be the data I would need (it would be the td associated with this th), but I figured this would be step 1 at least. If someone could provide input on the strategy to select by inner text, or to select the td associated with this th, I'd really appreciate it.

I am working on scraping a bunch of pages with Puppeteer. The content is not differentiated with classes/ids/etc. and is presented in a different order between pages. As such, I will need to select the elements based on their inner text. I have included a simplified sample html below:

<table>
<tr>
    <th>Product name</th>
    <td>Shakeweight</td>
</tr>
<tr>
    <th>Product category</th>
    <td>Exercise equipment</td>
</tr>
<tr>
    <th>Manufacturer name</th>
    <td>The Shakeweight Company</td>
</tr>
<tr>
    <th>Manufacturer address</th>
    <td>
        <table>
            <tr><td>123 Fake Street</td></tr>
            <tr><td>Springfield, MO</td></tr>
        </table>
    </td>
</tr>

In this example, I would need to scrape the manufacturer name and manufacturer address. So I suppose I would need to select the appropriate tr based upon the inner text of the nested th and scrape the associated td within that same tr. Note that the order of the rows of this table is not always the same and the table contains many more rows than this simplified example, so I can't just select the 3rd and 4th td.

I have tried to select an element based on inner text using XPATH as below but it does not seem to be working:

var manufacturerName = document.evaluate("//th[text()='Manufacturer name']", document, null, XPathResult.ANY_TYPE, null)

This wouldn't even be the data I would need (it would be the td associated with this th), but I figured this would be step 1 at least. If someone could provide input on the strategy to select by inner text, or to select the td associated with this th, I'd really appreciate it.

Share Improve this question edited Sep 24, 2020 at 17:59 MacGruber asked Sep 24, 2020 at 14:35 MacGruberMacGruber 531 silver badge4 bronze badges 1
  • Related: How to click on element with text in Puppeteer – ggorlen Commented Mar 5, 2023 at 6:15
Add a ment  | 

4 Answers 4

Reset to default 4

This is really an xpath question and isn't specific to puppeteer, so this question might also help, as you're going to need to find the <td> that es after the <th> you've found: XPath:: Get following Sibling

But your xpath does work for me. In Chrome DevTools on the page with the HTML in your question, run this line to query the document:

$x('//th[text()="Manufacturer name"]')

NOTE: $x() is a helper function that only works in Chrome DevTools, though Puppeteer has a similar Page.$x function.

That expression should return an array with one element, the <th> with that text in the query. To get the <td> next to it:

$x('//th[text()="Manufacturer name"]/following-sibling::td')

And to get its inner text:

$x('//th[text()="Manufacturer name"]/following-sibling::td')[0].innerText

Once you're able to follow that pattern you should be able to use similar strategies to get the data you want in puppeteer, similar to this:

const puppeteer = require('puppeteer');

const main = async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('http://127.0.0.1:8080/');  // <-- EDIT THIS

  const mfg = await page.$x('//th[text()="Manufacturer name"]/following-sibling::td');
  const prop = await mfg[0].getProperty('innerText');
  const text = await prop.jsonValue();
  console.log(text);

  await browser.close();
}

main();

You can do something like this to get the data:

await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url

await page.waitFor('table'); //waitFor an element that contains the text

const textDataArr = await page.evaluate(() => {
    const element = document.querySelector('table tbody tr:nth-child(3) td'); // select thrid row td element like so
    return element && element.innerText; // will return text and undefined if the element is not found
});
console.log(textDataArr);

As per your use case explanation in the above answer, here is the logic for the use case:

await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url

await page.waitFor('table'); //waitFor an element that contains the text

const textDataArr = await page.evaluate(() => {
    const trArr = Array.from(document.querySelectorAll('table tbody tr'));

    //Find an index of a tr row where th innerText equals 'Manufacturer name'
    let fetchValueRowIndex = trArr.findIndex((v, i) => {
        const element = document.querySelector('table tbody tr:nth-child(i+1) th');
        return element.innerText === 'Manufacturer name';
    });

    //If the findex is found return the innerText of td of the same row else returns undefined
    return (fetchValueRowIndex > -1) ? document.querySelector(`table tbody tr:nth-child(${fetchValueRowIndex}+1) td`).innerText : undefined;
});
console.log(textDataArr);

A simple way to get those all at once:

let data = await page.evaluate(() => {
  return [...document.querySelectorAll('tr')].reduce((acc, tr, i) => {
    let cells = [...tr.querySelectorAll('th,td')].map(el => el.innerText)
    acc[cells[0]] = cells[1]
    return acc
  }, {})
})
发布评论

评论列表(0)

  1. 暂无评论