最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - scraping text with cheerio - Stack Overflow

programmeradmin0浏览0评论

I am trying to scrape just Jung Ho Kang and 5 from this html and put it into an object. I want to exclude everything in the (R) and the SS.

<td id="lineup-table-top">
  <b class="text-muted pad-left-10">5</b>
  &nbsp;&nbsp;&nbsp;Jung Ho Kang 
  <small class="text-muted">(R)</small> 
  <small class="text-muted">SS</small>
</td>

Here is my code:

var someObjArr = [];

$('td#lineup-table-top').each(function(i, element){

    //Get the text from cheerio.
    var text = $(this).text();

    //if undefined, create the object inside of our array.
    if(someObjArr[i] == undefined){

        someObjArr[i] = {};
    };

    //Update the salary property of our object with the text value.
    someObjArr[i].name = text;

    $('b.pad-left-10').each(function(i, element){

        //Get the text from cheerio.
        var text = $(this).text();

        //if undefined, create the object inside of our array.
        if(someObjArr[i] == undefined){

            someObjArr[i] = {};
        };

        //Update the name property of our object with the text value.
        someObjArr[i].batting = text;
    });
});

The exact output from the code is as follows:

{ batting: '5',
  name: '5   Jung Ho Kang (R) SS 3B' }
{ name: '5   Jung Ho Kang (R) SS' },

The Expected output:

{ batting: '5',
  name: 'Jung Ho Kang' }

I don't know why it appears to be looping twice and I can't figure out how to isolate just the name without it having a class/id associated with it.

Any direction is enthusiastically appreciated.

I am trying to scrape just Jung Ho Kang and 5 from this html and put it into an object. I want to exclude everything in the (R) and the SS.

<td id="lineup-table-top">
  <b class="text-muted pad-left-10">5</b>
  &nbsp;&nbsp;&nbsp;Jung Ho Kang 
  <small class="text-muted">(R)</small> 
  <small class="text-muted">SS</small>
</td>

Here is my code:

var someObjArr = [];

$('td#lineup-table-top').each(function(i, element){

    //Get the text from cheerio.
    var text = $(this).text();

    //if undefined, create the object inside of our array.
    if(someObjArr[i] == undefined){

        someObjArr[i] = {};
    };

    //Update the salary property of our object with the text value.
    someObjArr[i].name = text;

    $('b.pad-left-10').each(function(i, element){

        //Get the text from cheerio.
        var text = $(this).text();

        //if undefined, create the object inside of our array.
        if(someObjArr[i] == undefined){

            someObjArr[i] = {};
        };

        //Update the name property of our object with the text value.
        someObjArr[i].batting = text;
    });
});

The exact output from the code is as follows:

{ batting: '5',
  name: '5   Jung Ho Kang (R) SS 3B' }
{ name: '5   Jung Ho Kang (R) SS' },

The Expected output:

{ batting: '5',
  name: 'Jung Ho Kang' }

I don't know why it appears to be looping twice and I can't figure out how to isolate just the name without it having a class/id associated with it.

Any direction is enthusiastically appreciated.

Share Improve this question edited Aug 11, 2015 at 18:55 CiscoKidx asked Aug 11, 2015 at 18:38 CiscoKidxCiscoKidx 9209 silver badges30 bronze badges 2
  • In your output you show values that aren't in the HTML you posted. Can you edit your question to include all of the HTML? – Jordan Running Commented Aug 11, 2015 at 18:51
  • @Jordan Done. My bad. – CiscoKidx Commented Aug 11, 2015 at 18:58
Add a ment  | 

1 Answer 1

Reset to default 8

Looks like you want to scrape only the text nodes in the markup.

https://github./cheeriojs/cheerio/issues/359

I'm not sure if nodeType is supported yet, but you should try to use that first. (nodeType docs)

$('td#lineup-table-top').contents().each(function(i, element){
    someObjArr[i] = someObjArr[i] || {};

    // The first element in #linup-table-top is batting stats
    if ( i === 0 && $(element).hasClass('pad-left-10') ) {
        someObjArr[i].name = $(element).text().trim();
    }

    // The raw text inside of #lineup-table-top the player name
    if ( element.nodeType === 3 ) {

        someObjArr[i].name = $(element).toString().trim();
    }
});

If it's not supported, you can fall back to using element.type

if ( element.type === 'text' ) {
    someObjArr[i] = someObjArr[i] || {};
    someObjArr[i].name = $(element).toString().trim();
}

I used this in the past to scrape only the text within an entire page of markup.

// For each DOM element in the page
$('*').each(function(i, element) {
    // Scrape only the text nodes
    $(element).contents().each(function(i, element) {
        if (element.type === 'text') {

        }
    });
});
发布评论

评论列表(0)

  1. 暂无评论