right now I'm working on a webcrawler. This one should parse some specific sites and give me an output into an xml-file. Up to this point, it's no problem. The Crawler works and you can customize it realy quickly via a cfg-file. I use Jsoup to parse the HTML-content.
I just added a few more sites and noticed that I got a huge problem with HTML-content that is created via JavaScript. Isn't there a way to make Jsoup supporting Javascript? Or at least get the full HTML-content I can see in my browser.
I already tried HtmlUnit, but this one didn't do well. It did not give me the content I would get in my browser.
Sincerly,
Ogofo
right now I'm working on a webcrawler. This one should parse some specific sites and give me an output into an xml-file. Up to this point, it's no problem. The Crawler works and you can customize it realy quickly via a cfg-file. I use Jsoup to parse the HTML-content.
I just added a few more sites and noticed that I got a huge problem with HTML-content that is created via JavaScript. Isn't there a way to make Jsoup supporting Javascript? Or at least get the full HTML-content I can see in my browser.
I already tried HtmlUnit, but this one didn't do well. It did not give me the content I would get in my browser.
Sincerly,
Ogofo
Share Improve this question asked Sep 27, 2012 at 15:37 OgofoOgofo 3662 gold badges6 silver badges13 bronze badges1 Answer
Reset to default 7Jsoup does not support javascript and it does not emulate a browser. Just forget about it if you're planning to execute Javascript. In my experience HtmlUnit, which is a headless browser, has given me the best results (always talking about Java frameworks).
One thing that worths trying in HtmlUnit is changing the BrowserVersion
(Chrome / InternetEplorer / FireFox) while creating the WebClient
instance. Some sites react in a different way and sometimes just changing that value might give you the results you expect to get.