I am required to use XPaths to select all links on a page, for then my Puppeteer app to click into and perform some actions. I am finding that the method (code below) is getting stuck sometimes and my crawler will be paused. Is there a better/different way of getting all links from an XPath? Or is there something in my code that is incorrect and could be pausing my app's progress?
try {
links = await this.getLinksFromXPathSelector(state);
} catch (e) {
console.log("error getting links");
return {...state, error: e};
}
Which calls:
async getLinksFromXPathSelector(state) {
const newPage = state.page
// console.log('links selector');
const links = await newPage.evaluate((mySelector) => {
let results = [];
let query = document.evaluate(mySelector,
document,
null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
for (let i=0, length=query.snapshotLength; i<length; ++i) {
results.push(query.snapshotItem(i).href);
}
return results;
}, state.linksSelector);
return links;
}
The XPath is in state.linksSelector
.
I am required to use XPaths to select all links on a page, for then my Puppeteer app to click into and perform some actions. I am finding that the method (code below) is getting stuck sometimes and my crawler will be paused. Is there a better/different way of getting all links from an XPath? Or is there something in my code that is incorrect and could be pausing my app's progress?
try {
links = await this.getLinksFromXPathSelector(state);
} catch (e) {
console.log("error getting links");
return {...state, error: e};
}
Which calls:
async getLinksFromXPathSelector(state) {
const newPage = state.page
// console.log('links selector');
const links = await newPage.evaluate((mySelector) => {
let results = [];
let query = document.evaluate(mySelector,
document,
null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
for (let i=0, length=query.snapshotLength; i<length; ++i) {
results.push(query.snapshotItem(i).href);
}
return results;
}, state.linksSelector);
return links;
}
The XPath is in state.linksSelector
.
1 Answer
Reset to default 6You can use page.$x()
to evaluate an XPath expression and obtain an ElementHandle
array. It may be appropriate to use page.waitForXPath()
beforehand to ensure that the elements specified by XPath string are added to the DOM.
Then you can pass the ElementHandle
array elements to the page context via page.evaluate()
and return an array containing the href
attribute values for each element.
const xpath_expression = '//a[@href]';
await page.waitForXPath(xpath_expression);
const links = await page.$x(xpath_expression);
const link_urls = await page.evaluate((...links) => {
return links.map(e => e.href);
}, ...links);
console.log(link_urls);