最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

php - Scrape web page data generated by javascript - Stack Overflow

programmeradmin2浏览0评论

My question is: How to scrape data from this website .aspx But the data is not shown until you click on for example "Danh sách chậm". I have tried very hard and carefully, when you click on "Danh sách chậm" this is onclick event which triggers some javascript functions one of the js functions is to get the data from the server and insert it to a tag/place holder and at this point you can use something like firefox to examine the data and yes, the data is display to users/viewers on the webpage. So again, how can we scrap this data programmatically?

i wrote a scrapping function but ofcourse it does not get the data i want because the data is not available until i click on the button "Danh sách chậm"

<?php
$Page = file_get_contents('.aspx');
$dom_document = new DOMDocument();
$dom_document->loadHTML($Page);
$dom_xpath_admin = new DOMXpath($dom_document_admin);
$elements = $dom_xpath->query("*//td[@class='IconMenuColumn']");
foreach ($elements as $element) {
    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
        echo mb_convert_encoding($node->c14n(), 'iso-8859-1', mb_detect_encoding($content, 'UTF-8', true));
    }
}

My question is: How to scrape data from this website http://vtis.vn/index.aspx But the data is not shown until you click on for example "Danh sách chậm". I have tried very hard and carefully, when you click on "Danh sách chậm" this is onclick event which triggers some javascript functions one of the js functions is to get the data from the server and insert it to a tag/place holder and at this point you can use something like firefox to examine the data and yes, the data is display to users/viewers on the webpage. So again, how can we scrap this data programmatically?

i wrote a scrapping function but ofcourse it does not get the data i want because the data is not available until i click on the button "Danh sách chậm"

<?php
$Page = file_get_contents('http://vtis.vn/index.aspx');
$dom_document = new DOMDocument();
$dom_document->loadHTML($Page);
$dom_xpath_admin = new DOMXpath($dom_document_admin);
$elements = $dom_xpath->query("*//td[@class='IconMenuColumn']");
foreach ($elements as $element) {
    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
        echo mb_convert_encoding($node->c14n(), 'iso-8859-1', mb_detect_encoding($content, 'UTF-8', true));
    }
}
Share Improve this question edited Feb 20, 2020 at 20:21 miken32 42.7k16 gold badges121 silver badges172 bronze badges asked Sep 27, 2012 at 22:21 DungDung 20.6k9 gold badges63 silver badges58 bronze badges 4
  • 3 probably need to use something like phantomjs to "click" the button. Though you really should avoid "scraping" data at all costs. – Chad Commented Sep 27, 2012 at 22:23
  • i do not think your answer helping, but thanks anyhow. I scrape the data because it belongs the public or tax payers, and also serve it to the public just by a different mean. – Dung Commented Sep 28, 2012 at 3:58
  • Then do what I suggested, use something like phantomjs and script the browser. More than likely the data you are talking about is grabbed via AJAX. You will have to simulate a click, wait for the AJAX to update the page, then scape it. I don't know that doesn't help. – Chad Commented Sep 29, 2012 at 22:50
  • Thanks, phantomjs is a possible solution. – Dung Commented Oct 4, 2012 at 22:06
Add a ment  | 

2 Answers 2

Reset to default 6

You need to look at PhantomJS.

From their site:

PhantomJS is a headless WebKit with JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

Using the API you can script the "browser" to interact with that page and scrape the data you need. You can then do whatever you need with it; including passing it to a PHP script if necessary.


That being said, if at all possible try not to "scrape" the data. If there is an ajax call the page is making, maybe there is an API you can use instead? If not, maybe you can convince them to make one. That would of course be much easier and more maintainable than screen scraping.

First, you need PhantomJS. Suggested install method on Linux:

wget https://bitbucket/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
tar xvf phantomjs-2.1.1-linux-x86_64.tar.bz2
cp phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin

Second, you need the php-phantomjs package. Assuming you have installed Composer:

poser require jonnyw/php-phantomjs

Or follow installation documentation here.

Third, Load the package to your script, and instead of file_get_contents, you will load the page via PhantomJS

<?php
require ('vendor/autoload.php');

$client = Client::getInstance();
$client->getEngine()->setPath('/usr/local/bin/phantomjs');
$client = Client::getInstance();
$request  = $client->getMessageFactory()->createRequest();
$response = $client->getMessageFactory()->createResponse();

$request->setMethod('GET');
$request->setUrl('https://www.your_page_embeded_ajax_request');

$client->send($request, $response);

if($response->getStatus() === 200) {
    echo "Do something here";
}
发布评论

评论列表(0)

  1. 暂无评论