I want to get the first childNode of a div using cheerio. I am getting it using javascript dom manipulation. but can't get it on cheerio.
I have already tried it in dev tool and I got the expected result. but I want it by using cheerio.
javascript
document.querySelector('.title_wrapper .subtext').childNodes[0].textContent;
I want to get the text 'PG' from this element.
<div class="subtext">
PG
<span class="ghost">|</span>
<time datetime="PT121M">
2h 1min
</time>
<span class="ghost">|</span>
<a href="/search/title?genres=action&explore=title_type,genres&ref_=tt_ov_inf">Action</a>,
<a href="/search/title?genres=adventure&explore=title_type,genres&ref_=tt_ov_inf">Adventure</a>,
<a href="/search/title?genres=fantasy&explore=title_type,genres&ref_=tt_ov_inf">Fantasy</a>
<span class="ghost">|</span>
<a href="/title/tt0076759/releaseinfo?ref_=tt_ov_inf" title="See more release dates">25 May 1977 (USA)</a>
</div>
I want to get the first childNode of a div using cheerio. I am getting it using javascript dom manipulation. but can't get it on cheerio.
I have already tried it in dev tool and I got the expected result. but I want it by using cheerio.
javascript
document.querySelector('.title_wrapper .subtext').childNodes[0].textContent;
I want to get the text 'PG' from this element.
<div class="subtext">
PG
<span class="ghost">|</span>
<time datetime="PT121M">
2h 1min
</time>
<span class="ghost">|</span>
<a href="/search/title?genres=action&explore=title_type,genres&ref_=tt_ov_inf">Action</a>,
<a href="/search/title?genres=adventure&explore=title_type,genres&ref_=tt_ov_inf">Adventure</a>,
<a href="/search/title?genres=fantasy&explore=title_type,genres&ref_=tt_ov_inf">Fantasy</a>
<span class="ghost">|</span>
<a href="/title/tt0076759/releaseinfo?ref_=tt_ov_inf" title="See more release dates">25 May 1977 (USA)</a>
</div>
Share
Improve this question
edited Jul 22, 2019 at 16:36
Taylor A. Leach
2,3444 gold badges28 silver badges47 bronze badges
asked Jul 22, 2019 at 15:09
Ashiqur Rahman AshiqueAshiqur Rahman Ashique
371 silver badge5 bronze badges
1
- 2 can you please provide a minimal-reproducible-example? – Peter Commented Jul 22, 2019 at 15:54
4 Answers
Reset to default 3You almost had it, just use [0] to get the javascript node:
$('.subtext')[0].childNodes[0].nodeValue.trim()
On your specific situation, this is how to fetch data and you can apply to a massive bulk of data to your extraction:
var fullText = $('.subtext').text();
// Returns:
// PG|2h 1min|Action,Adventure,Fantasy|25 May 1977 (USA)
var arrSplit = fullText.split('|');
// Splits by ('|') pipe character into an Array
// [ 'PG', '2h 1min', 'Action,Adventure,Fantasy', '25 May 1977 (USA)' ]
var firstChildNode = arrSplit[0];
// Gets the "first" childNode of this specific situation
// PG
You can clone the parent and then remove all the child elements, leaving only the text for you to select.
$(".title_wrapper .subtext")
.clone() //clone the element
.children() //select all children
.remove() //remove all children
.end() //go back to selected element
.text(); //get the text of element
This is an old jQuery solution.
Assuming .title_wrapper
exists on your actual page, your code works:
console.log(document.querySelector('.subtext').childNodes[0].textContent);
<div class="subtext">
PG
<span class="ghost">|</span>
<time datetime="PT121M">
2h 1min
</time>
<span class="ghost">|</span>
<a href="/search/title?genres=action&explore=title_type,genres&ref_=tt_ov_inf">Action</a>,
<a href="/search/title?genres=adventure&explore=title_type,genres&ref_=tt_ov_inf">Adventure</a>,
<a href="/search/title?genres=fantasy&explore=title_type,genres&ref_=tt_ov_inf">Fantasy</a>
<span class="ghost">|</span>
<a href="/title/tt0076759/releaseinfo?ref_=tt_ov_inf" title="See more release dates">25 May 1977 (USA)</a>
</div>
In Cheerio, it should also work:
const $ = cheerio.load(document.body.outerHTML);
console.log($(".subtext").contents().eq(0).text());
<div class="subtext">
PG
<span class="ghost">|</span>
<time datetime="PT121M">
2h 1min
</time>
<span class="ghost">|</span>
<a href="/search/title?genres=action&explore=title_type,genres&ref_=tt_ov_inf">Action</a>,
<a href="/search/title?genres=adventure&explore=title_type,genres&ref_=tt_ov_inf">Adventure</a>,
<a href="/search/title?genres=fantasy&explore=title_type,genres&ref_=tt_ov_inf">Fantasy</a>
<span class="ghost">|</span>
<a href="/title/tt0076759/releaseinfo?ref_=tt_ov_inf" title="See more release dates">25 May 1977 (USA)</a>
</div>
<!-- Warning: don't use this in production, just for demos -->
<script src="https://bundle.run/[email protected]"></script>
If it doesn't, then the page probably has JS-driven behavior that adds the content after page load. By the time you run the code in dev tools, it's loaded, but Cheerio doesn't run JS. Servers may also serve different HTML responses for Node requests versus browser requests. You'll need to share a minimal, reproducible example with the actual page to get help beyond this.
See also:
- How can I scrape pages with dynamic content using node.js?
- How to get a text that's separated by different HTML tags in Cheerio
- cheerio: Get normal + text nodes