最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - How to get childNodes of a div in cheerio? - Stack Overflow

programmeradmin1浏览0评论

I want to get the first childNode of a div using cheerio. I am getting it using javascript dom manipulation. but can't get it on cheerio.

I have already tried it in dev tool and I got the expected result. but I want it by using cheerio.

javascript

document.querySelector('.title_wrapper .subtext').childNodes[0].textContent;

I want to get the text 'PG' from this element.

<div class="subtext">
    PG
    <span class="ghost">|</span>
    <time datetime="PT121M">
        2h 1min
    </time>
    <span class="ghost">|</span>
    <a href="/search/title?genres=action&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Action</a>,
    <a href="/search/title?genres=adventure&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Adventure</a>,
    <a href="/search/title?genres=fantasy&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Fantasy</a>
    <span class="ghost">|</span>
    <a href="/title/tt0076759/releaseinfo?ref_=tt_ov_inf" title="See more release dates">25 May 1977 (USA)</a>
</div>

I want to get the first childNode of a div using cheerio. I am getting it using javascript dom manipulation. but can't get it on cheerio.

I have already tried it in dev tool and I got the expected result. but I want it by using cheerio.

javascript

document.querySelector('.title_wrapper .subtext').childNodes[0].textContent;

I want to get the text 'PG' from this element.

<div class="subtext">
    PG
    <span class="ghost">|</span>
    <time datetime="PT121M">
        2h 1min
    </time>
    <span class="ghost">|</span>
    <a href="/search/title?genres=action&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Action</a>,
    <a href="/search/title?genres=adventure&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Adventure</a>,
    <a href="/search/title?genres=fantasy&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Fantasy</a>
    <span class="ghost">|</span>
    <a href="/title/tt0076759/releaseinfo?ref_=tt_ov_inf" title="See more release dates">25 May 1977 (USA)</a>
</div>
Share Improve this question edited Jul 22, 2019 at 16:36 Taylor A. Leach 2,3444 gold badges28 silver badges47 bronze badges asked Jul 22, 2019 at 15:09 Ashiqur Rahman AshiqueAshiqur Rahman Ashique 371 silver badge5 bronze badges 1
  • 2 can you please provide a minimal-reproducible-example? – Peter Commented Jul 22, 2019 at 15:54
Add a ment  | 

4 Answers 4

Reset to default 3

You almost had it, just use [0] to get the javascript node:

$('.subtext')[0].childNodes[0].nodeValue.trim()

On your specific situation, this is how to fetch data and you can apply to a massive bulk of data to your extraction:

    var fullText = $('.subtext').text();
    // Returns:
    // PG|2h 1min|Action,Adventure,Fantasy|25 May 1977 (USA)

    var arrSplit = fullText.split('|');
    // Splits by ('|') pipe character into an Array
    // [ 'PG', '2h 1min', 'Action,Adventure,Fantasy', '25 May 1977 (USA)' ]

    var firstChildNode = arrSplit[0];
    // Gets the "first" childNode of this specific situation
    // PG

You can clone the parent and then remove all the child elements, leaving only the text for you to select.

$(".title_wrapper .subtext")
  .clone()    //clone the element
  .children() //select all children
  .remove()   //remove all children
  .end()      //go back to selected element
  .text();    //get the text of element

This is an old jQuery solution.

Assuming .title_wrapper exists on your actual page, your code works:

console.log(document.querySelector('.subtext').childNodes[0].textContent);
<div class="subtext">
    PG
    <span class="ghost">|</span>
    <time datetime="PT121M">
        2h 1min
    </time>
    <span class="ghost">|</span>
    <a href="/search/title?genres=action&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Action</a>,
    <a href="/search/title?genres=adventure&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Adventure</a>,
    <a href="/search/title?genres=fantasy&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Fantasy</a>
    <span class="ghost">|</span>
    <a href="/title/tt0076759/releaseinfo?ref_=tt_ov_inf" title="See more release dates">25 May 1977 (USA)</a>
</div>

In Cheerio, it should also work:

const $ = cheerio.load(document.body.outerHTML);
console.log($(".subtext").contents().eq(0).text());
<div class="subtext">
    PG
    <span class="ghost">|</span>
    <time datetime="PT121M">
        2h 1min
    </time>
    <span class="ghost">|</span>
    <a href="/search/title?genres=action&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Action</a>,
    <a href="/search/title?genres=adventure&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Adventure</a>,
    <a href="/search/title?genres=fantasy&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Fantasy</a>
    <span class="ghost">|</span>
    <a href="/title/tt0076759/releaseinfo?ref_=tt_ov_inf" title="See more release dates">25 May 1977 (USA)</a>
</div>
<!-- Warning: don't use this in production, just for demos -->
<script src="https://bundle.run/[email protected]"></script>

If it doesn't, then the page probably has JS-driven behavior that adds the content after page load. By the time you run the code in dev tools, it's loaded, but Cheerio doesn't run JS. Servers may also serve different HTML responses for Node requests versus browser requests. You'll need to share a minimal, reproducible example with the actual page to get help beyond this.

See also:

  • How can I scrape pages with dynamic content using node.js?
  • How to get a text that's separated by different HTML tags in Cheerio
  • cheerio: Get normal + text nodes
发布评论

评论列表(0)

  1. 暂无评论