I'm new to pupeteer and don't know it's full potential. I have the following code that return results from scrape. But the format is one long tab delimited string. I'm trying to get a proper json.
(async () => {
const browser = await puppeteer.launch( {headless: true} );
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'networkidle0'});
let data = await page.evaluate(() => {
const table = Array.from(document.querySelectorAll('table[id="gvM"] > tbody > tr '));
return table.map(td => td.innerText);
})
console.log(data);
})();
Here is the html table:
<table cellspacing="0" cellpadding="4" rules="all" border="1" id="gvM" >
<tr >
<th scope="col">#</th><th scope="col">Resource</th><th scope="col">EM #</th><th scope="col">CVO</th><th scope="col">Start</th><th scope="col">End</th><th scope="col">Status</th><th scope="col">Assignment</th><th scope="col"> </th>
</tr>
<tr >
<td>31</td><td>Smith</td><td>618</td><td align="center"><span class="aspNetDisabled"><input id="gvM_ctl00_0" type="checkbox" name="gvM$ctl02$ctl00" disabled="disabled" /></span></td><td> </td><td> </td><td>AVAILABLE EXEC</td><td style="width:800px;">6F</td><td align="center"></td>
</tr>
<tr style="background-color:LightGreen;">
<td>1</td><td>John</td><td>604</td><td align="center"><span class="aspNetDisabled"></span></td><td>1400</td><td>2200</td><td>AVAILABLE</td><td style="width:800px;"> </td><td align="center"></td>
</tr>
</table>
This is what I get:
[ '#\tResource\tEM #\tCVO\tStart\tEnd\tStatus\tAssignment\t ',
'31\tSmith\t618\t\t \t \tAVAILABLE EXEC\t6F\t',
'1\tJohn\t604\t\t1400\t2200\tAVAILABLE\t \t']
and this is what I want to get:
[{'#','Resource','EM', '#','CVO','Start','tEnd','Status', 'Assignment'},
{'31','Smith', '618',' ',' ',' ',' ','AVAILABLE EXEC','6F'},
{'1','John', '604',' ',' ','1400 ','2200','AVAILABLE', ' '}]
I applied the answer below, but I wasn't able to reproduce the results. Perhaps I'm doing something wrong. Could you explain how e I'm messing up?
const context = document.querySelectorAll('table[id="gvM"] > tbody > tr ');
const query = (selector, context) => Array.from(context.querySelectorAll(selector));
console.log(
query('tr', context).map(row =>
query('td, th', row).map(cell =>
cell.textContent))
);
What does this error mean?
(node:6204) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with. .catch(). (rejection id: 1)
(node:6204) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
I'm new to pupeteer and don't know it's full potential. I have the following code that return results from scrape. But the format is one long tab delimited string. I'm trying to get a proper json.
(async () => {
const browser = await puppeteer.launch( {headless: true} );
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'networkidle0'});
let data = await page.evaluate(() => {
const table = Array.from(document.querySelectorAll('table[id="gvM"] > tbody > tr '));
return table.map(td => td.innerText);
})
console.log(data);
})();
Here is the html table:
<table cellspacing="0" cellpadding="4" rules="all" border="1" id="gvM" >
<tr >
<th scope="col">#</th><th scope="col">Resource</th><th scope="col">EM #</th><th scope="col">CVO</th><th scope="col">Start</th><th scope="col">End</th><th scope="col">Status</th><th scope="col">Assignment</th><th scope="col"> </th>
</tr>
<tr >
<td>31</td><td>Smith</td><td>618</td><td align="center"><span class="aspNetDisabled"><input id="gvM_ctl00_0" type="checkbox" name="gvM$ctl02$ctl00" disabled="disabled" /></span></td><td> </td><td> </td><td>AVAILABLE EXEC</td><td style="width:800px;">6F</td><td align="center"></td>
</tr>
<tr style="background-color:LightGreen;">
<td>1</td><td>John</td><td>604</td><td align="center"><span class="aspNetDisabled"></span></td><td>1400</td><td>2200</td><td>AVAILABLE</td><td style="width:800px;"> </td><td align="center"></td>
</tr>
</table>
This is what I get:
[ '#\tResource\tEM #\tCVO\tStart\tEnd\tStatus\tAssignment\t ',
'31\tSmith\t618\t\t \t \tAVAILABLE EXEC\t6F\t',
'1\tJohn\t604\t\t1400\t2200\tAVAILABLE\t \t']
and this is what I want to get:
[{'#','Resource','EM', '#','CVO','Start','tEnd','Status', 'Assignment'},
{'31','Smith', '618',' ',' ',' ',' ','AVAILABLE EXEC','6F'},
{'1','John', '604',' ',' ','1400 ','2200','AVAILABLE', ' '}]
I applied the answer below, but I wasn't able to reproduce the results. Perhaps I'm doing something wrong. Could you explain how e I'm messing up?
const context = document.querySelectorAll('table[id="gvM"] > tbody > tr ');
const query = (selector, context) => Array.from(context.querySelectorAll(selector));
console.log(
query('tr', context).map(row =>
query('td, th', row).map(cell =>
cell.textContent))
);
What does this error mean?
(node:6204) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with. .catch(). (rejection id: 1)
(node:6204) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
- 2 The wanted json in your question is invalid. – Niloct Commented Mar 27, 2019 at 0:45
2 Answers
Reset to default 4If you need an array of arrays from the table, you can try this approach, with mapping all rows to an array of rows and all cells to an array of cells inside a row element (this variant uses Array.from()
with mapping function as a second argument):
const data = await page.evaluate(
() => Array.from(
document.querySelectorAll('table[id="gvM"] > tbody > tr'),
row => Array.from(row.querySelectorAll('th, td'), cell => cell.innerText)
)
);
I don't think this is related to Puppeteer but to the way you "iterate" over your <table>
:
In your attempt, you're simply dumping the textual content of an entire row which produces the result that you're observing. Actually for each <tr>
you need to get all its <td>
(or <th>
) elements:
const query = (selector, context) =>
Array.from(context.querySelectorAll(selector));
console.log(
query('tr', document).map(row =>
query('td, th', row).map(cell =>
cell.textContent))
)
<table>
<tr>
<th>col 1</th>
<th>col 2</th>
<th>col 3</th>
</tr>
<tr>
<td>a</td>
<td>b</td>
<td>c</td>
</tr>
<tr>
<td>x</td>
<td>y</td>
<td>z</td>
</tr>
</table>