最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - cheerio find a text in a script tag - Stack Overflow

programmeradmin1浏览0评论

I want to extract js script in script tag.

this the script tag :

<script>
  $(document).ready(function(){

    $("#div1").click(function(){
      $("#divcontent").load("ajax.content.php?p=0&cat=1");
    });

    $("#div2").click(function(){
      $("#divcontent").load("ajax.content.php?p=1&cat=1");
    });

  });
</script>

I have an array of ids like ['div1', 'div2'], and I need to extract url link inside it : so if i call a function :

getUrlOf('div1');

it will return ajax.content.php?p=0&cat=1

I want to extract js script in script tag.

this the script tag :

<script>
  $(document).ready(function(){

    $("#div1").click(function(){
      $("#divcontent").load("ajax.content.php?p=0&cat=1");
    });

    $("#div2").click(function(){
      $("#divcontent").load("ajax.content.php?p=1&cat=1");
    });

  });
</script>

I have an array of ids like ['div1', 'div2'], and I need to extract url link inside it : so if i call a function :

getUrlOf('div1');

it will return ajax.content.php?p=0&cat=1

Share Improve this question edited Dec 18, 2018 at 21:59 brooksrelyt 4,0356 gold badges36 silver badges57 bronze badges asked Dec 18, 2018 at 19:00 yozawiratamayozawiratama 4,32812 gold badges65 silver badges108 bronze badges 1
  • Whatever you're trying to do, this seems to be the wrong way of going about it. However the PHP file generates this inline code, you should be getting the links the same way the PHP source does, not by parsing inline JavaScript source to obtain hard-coded string values within event handlers. – Patrick Roberts Commented Dec 18, 2018 at 19:09
Add a ment  | 

2 Answers 2

Reset to default 6

If you're using a newer version of cheerio (1.0.0-rc.2), you'll need to use .html() instead of .text()

const cheerio = require('cheerio');
const $ = cheerio.load('<script>script one</script>  <script>  script two</script>');

// For the first script tag
console.log($('script').html());

// For all script tags
console.log($('script').map((idx, el) => $(el).html()).toArray());

https://github./cheeriojs/cheerio/issues/1050

With Cheerio, it is very easy to get the text of the script tag:

const cheerio = require('cheerio');
const $ = cheerio.load("the HTML the webpage you are scraping");

// If there's only one <script>
console.log($('script').text());

// If there's multiple scripts
$('script').each((idx, elem) => console.log(elem.text()));

From here, you're really just asking "how do I parse a generic block of javascript and extract a list of links". I agree with Patrick above in the ments, you probably shouldn't. Can you craft a regex that will let you find each link in the script and deduce the page it links to? Yes. But very likely, if anything about this page changes, your script will immediately break - the author of the page might switch to inline <a> tags, refactor the code, use live events, etc.

Just be aware that relying on the exact contents of this script tag will make your application very brittle -- even more brittle than page scraping generally is.

Here's an example of a loose but effective regex:

let html = "ining html";
let regex = /\$\("(#.+?)"\)\.click(?:.|\n)+?\.load\("(.+?)"/;
let match;

while (match = regex.exec(html)) {
    console.log(match[1] + ': ' + match[2]);
}

In case you are new to regex: this expression contains two capture groups, in parens (the first is the div id, the second is the link text), as well as a non-capturing group in the middle, which exists only to make sure the regex will continue through a line break. I say it's "loose" because the match it is looking for looks like this:

  • $("***").click***ignored chars***.load("***"

So, depending on how much javascript there is and how similar it is, you might have to tighten it up to avoid false positives.

发布评论

评论列表(0)

  1. 暂无评论