I am working on scraping a bunch of pages with Puppeteer. The content is not differentiated with classes/ids/etc. and is presented in a different order between pages. As such, I will need to select the elements based on their inner text. I have included a simplified sample html below:
<table>
<tr>
<th>Product name</th>
<td>Shakeweight</td>
</tr>
<tr>
<th>Product category</th>
<td>Exercise equipment</td>
</tr>
<tr>
<th>Manufacturer name</th>
<td>The Shakeweight Company</td>
</tr>
<tr>
<th>Manufacturer address</th>
<td>
<table>
<tr><td>123 Fake Street</td></tr>
<tr><td>Springfield, MO</td></tr>
</table>
</td>
</tr>
In this example, I would need to scrape the manufacturer name and manufacturer address. So I suppose I would need to select the appropriate tr based upon the inner text of the nested th and scrape the associated td within that same tr. Note that the order of the rows of this table is not always the same and the table contains many more rows than this simplified example, so I can't just select the 3rd and 4th td.
I have tried to select an element based on inner text using XPATH as below but it does not seem to be working:
var manufacturerName = document.evaluate("//th[text()='Manufacturer name']", document, null, XPathResult.ANY_TYPE, null)
This wouldn't even be the data I would need (it would be the td associated with this th), but I figured this would be step 1 at least. If someone could provide input on the strategy to select by inner text, or to select the td associated with this th, I'd really appreciate it.
I am working on scraping a bunch of pages with Puppeteer. The content is not differentiated with classes/ids/etc. and is presented in a different order between pages. As such, I will need to select the elements based on their inner text. I have included a simplified sample html below:
<table>
<tr>
<th>Product name</th>
<td>Shakeweight</td>
</tr>
<tr>
<th>Product category</th>
<td>Exercise equipment</td>
</tr>
<tr>
<th>Manufacturer name</th>
<td>The Shakeweight Company</td>
</tr>
<tr>
<th>Manufacturer address</th>
<td>
<table>
<tr><td>123 Fake Street</td></tr>
<tr><td>Springfield, MO</td></tr>
</table>
</td>
</tr>
In this example, I would need to scrape the manufacturer name and manufacturer address. So I suppose I would need to select the appropriate tr based upon the inner text of the nested th and scrape the associated td within that same tr. Note that the order of the rows of this table is not always the same and the table contains many more rows than this simplified example, so I can't just select the 3rd and 4th td.
I have tried to select an element based on inner text using XPATH as below but it does not seem to be working:
var manufacturerName = document.evaluate("//th[text()='Manufacturer name']", document, null, XPathResult.ANY_TYPE, null)
This wouldn't even be the data I would need (it would be the td associated with this th), but I figured this would be step 1 at least. If someone could provide input on the strategy to select by inner text, or to select the td associated with this th, I'd really appreciate it.
Share Improve this question edited Sep 24, 2020 at 17:59 MacGruber asked Sep 24, 2020 at 14:35 MacGruberMacGruber 531 silver badge4 bronze badges 1- Related: How to click on element with text in Puppeteer – ggorlen Commented Mar 5, 2023 at 6:15
4 Answers
Reset to default 4This is really an xpath question and isn't specific to puppeteer, so this question might also help, as you're going to need to find the <td>
that es after the <th>
you've found: XPath:: Get following Sibling
But your xpath does work for me. In Chrome DevTools on the page with the HTML in your question, run this line to query the document:
$x('//th[text()="Manufacturer name"]')
NOTE: $x()
is a helper function that only works in Chrome DevTools, though Puppeteer has a similar Page.$x
function.
That expression should return an array with one element, the <th>
with that text in the query. To get the <td>
next to it:
$x('//th[text()="Manufacturer name"]/following-sibling::td')
And to get its inner text:
$x('//th[text()="Manufacturer name"]/following-sibling::td')[0].innerText
Once you're able to follow that pattern you should be able to use similar strategies to get the data you want in puppeteer, similar to this:
const puppeteer = require('puppeteer');
const main = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://127.0.0.1:8080/'); // <-- EDIT THIS
const mfg = await page.$x('//th[text()="Manufacturer name"]/following-sibling::td');
const prop = await mfg[0].getProperty('innerText');
const text = await prop.jsonValue();
console.log(text);
await browser.close();
}
main();
You can do something like this to get the data:
await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url
await page.waitFor('table'); //waitFor an element that contains the text
const textDataArr = await page.evaluate(() => {
const element = document.querySelector('table tbody tr:nth-child(3) td'); // select thrid row td element like so
return element && element.innerText; // will return text and undefined if the element is not found
});
console.log(textDataArr);
As per your use case explanation in the above answer, here is the logic for the use case:
await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url
await page.waitFor('table'); //waitFor an element that contains the text
const textDataArr = await page.evaluate(() => {
const trArr = Array.from(document.querySelectorAll('table tbody tr'));
//Find an index of a tr row where th innerText equals 'Manufacturer name'
let fetchValueRowIndex = trArr.findIndex((v, i) => {
const element = document.querySelector('table tbody tr:nth-child(i+1) th');
return element.innerText === 'Manufacturer name';
});
//If the findex is found return the innerText of td of the same row else returns undefined
return (fetchValueRowIndex > -1) ? document.querySelector(`table tbody tr:nth-child(${fetchValueRowIndex}+1) td`).innerText : undefined;
});
console.log(textDataArr);
A simple way to get those all at once:
let data = await page.evaluate(() => {
return [...document.querySelectorAll('tr')].reduce((acc, tr, i) => {
let cells = [...tr.querySelectorAll('th,td')].map(el => el.innerText)
acc[cells[0]] = cells[1]
return acc
}, {})
})