For a pany project, I need to create a web scraping application with PHP and JavaScript (including jQuery) that will extract specific data from each page of our clients' websites. The scraping app needs to get two types of data for each page: 1) determine whether certain HTML elements with specific IDs are present, and 2) extract the value of a specific JavaScript variable. The JS variable name is the same on each page, but the value is usually different.
I believe I know how I can get the first data requirement: using the PHP file_get_contents() function to get each page's HTML and then use JavaScript/jQuery to parse that HTML and search for elements with specific IDs. However, I'm not sure how to get the 2nd piece of data - the JavaScript variable values. The JavaScript variable isn't even found within each page's HTML; instead, it is found in an external JavaScript file that is linked to the page. And even if the JavaScript were embedded in the page's HTML, I know that file_get_contents() would only extract the JavaScript code (and other HTML) and not any variable values.
Can anyone suggest a good approach to getting this variable value for each page of a given website?
EDIT: Just to clarify, I need the values of the JavaScript variables after the JavaScript code has been run. Is such a thing even possible?
For a pany project, I need to create a web scraping application with PHP and JavaScript (including jQuery) that will extract specific data from each page of our clients' websites. The scraping app needs to get two types of data for each page: 1) determine whether certain HTML elements with specific IDs are present, and 2) extract the value of a specific JavaScript variable. The JS variable name is the same on each page, but the value is usually different.
I believe I know how I can get the first data requirement: using the PHP file_get_contents() function to get each page's HTML and then use JavaScript/jQuery to parse that HTML and search for elements with specific IDs. However, I'm not sure how to get the 2nd piece of data - the JavaScript variable values. The JavaScript variable isn't even found within each page's HTML; instead, it is found in an external JavaScript file that is linked to the page. And even if the JavaScript were embedded in the page's HTML, I know that file_get_contents() would only extract the JavaScript code (and other HTML) and not any variable values.
Can anyone suggest a good approach to getting this variable value for each page of a given website?
EDIT: Just to clarify, I need the values of the JavaScript variables after the JavaScript code has been run. Is such a thing even possible?
Share Improve this question edited May 10, 2011 at 15:38 jake asked May 10, 2011 at 14:12 jakejake 1,9393 gold badges23 silver badges31 bronze badges 7- its actually better to have the variable in the external javascript file, that way, you just need to look for the <script> tag, just like how you look for the first one, then get its src. from the src link, you'll need to create a new JS scraper. i dont know if someone here's able to create a regex for you – dragonjet Commented May 10, 2011 at 14:17
- I see what you're saying, but that external javascript file won't run independent of the pages it is linked to. The variable value I am looking for changes based on the URL of the page the js file is linked to. – jake Commented May 10, 2011 at 14:24
- Are you saying that you require the values of some JS variables after some JS code has been run, or simply the initial values assigned? – jlb Commented May 10, 2011 at 15:32
- Correct, I need the values of the js variables after the js code has been run. – jake Commented May 10, 2011 at 15:36
-
1
@Jake can you identify how the js file is normally laid out? Does it look like this:
var myValue = 'stringvalue';
because if so, a regex could be helpful here (dear god, what did I say? :p ) – jcolebrand Commented May 10, 2011 at 15:45
4 Answers
Reset to default 5You say you need the value of the variable after the JS has executed. I assume it's always the same JS, with just initial variable values being the thing that changes. Your best bet is to port the JS to PHP, which lets you extract the initial JS variable values and then pretend you executed the JS.
Here's a function for extracting variable values from JavaScript:
/**
* extracts a variable value given its name and type. makes certain assumptions about the source,
* i.e. can't handle strings with escaped quotes.
*
* @param string $jsText the JavaScript source
* @param string $name the name of the variable
* @param string $type the variable type, either 'string' (default), 'float' or 'int'
* @return string|int|float the extracted variable value
*/
function extractVar($jsText, $name, $type = 'string') {
if ($type == 'string') {
$valueMatch = "(\"|')(.*?)(\"|')";
} else {
$valueMatch = "([0-9.]+?)";
}
preg_match("/$name\s*\=\s*$valueMatch/", $jsText, $matches);
if ($type == 'string') {
return $matches[2];
} else if ($type == 'float') {
return (float)$matches[1];
} else if ($type == 'int') {
return (int)$matches[1];
} else {
return false;
}
}
presumably this is impossible because it seems so simple, but if it's your .js you're trying to detect, why not just have that .js do something detectable via scrape to the page?
use the js to populate a tag like this somewhere (via element.innerHTML, presumably):
<span><!--Important js thing has been activated!--></span>.
edit: alternately, maybe use a document.write, if the script needs to be detectable onload
Cant you use a js script that will be sent to your clients and that script send the info to your server?
You may be able to use Zombie.js a Node(js) library: http://zombie.labnotes/
It can click links, walk the dom tree, and should be able to parse JS since it is JavaScript that's running it all.