最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Parsing HTML page containing JS in Java - Stack Overflow

programmeradmin3浏览0评论

I am trying to parse a web page, which contains some JS. Till now I am using Jsoup to parse html in Java, which is working as expected. But I am unable to parse the JavaScript. Below is the snippet of the HTML page-

<script type="text/javascript"> 
var element = document.createElement("input"); 
element.setAttribute("type", "hidden");
element.setAttribute("value", "");
element.setAttribute("name", "AzPwXPs");
element.setAttribute("id", "AzPwXPs");
var foo = document.getElementById("dnipb"); 
foo.appendChild(element);
var element1 = document.createElement("input"); 
element1.setAttribute("type", "hidden");
element1.setAttribute("value", "6D6AB8AECC9B28235F1DE39D879537E1");
element1.setAttribute("name", "ZLZWNK");
element1.setAttribute("id", "ZLZWNK");
foo.appendChild(element1);
</script>

I want to read both the values with their name/id. So that after parsing I can get following results-

AzPwXPs=
ZLZWNK=6D6AB8AECC9B28235F1DE39D879537E1

How to parse in this situation?

I am trying to parse a web page, which contains some JS. Till now I am using Jsoup to parse html in Java, which is working as expected. But I am unable to parse the JavaScript. Below is the snippet of the HTML page-

<script type="text/javascript"> 
var element = document.createElement("input"); 
element.setAttribute("type", "hidden");
element.setAttribute("value", "");
element.setAttribute("name", "AzPwXPs");
element.setAttribute("id", "AzPwXPs");
var foo = document.getElementById("dnipb"); 
foo.appendChild(element);
var element1 = document.createElement("input"); 
element1.setAttribute("type", "hidden");
element1.setAttribute("value", "6D6AB8AECC9B28235F1DE39D879537E1");
element1.setAttribute("name", "ZLZWNK");
element1.setAttribute("id", "ZLZWNK");
foo.appendChild(element1);
</script>

I want to read both the values with their name/id. So that after parsing I can get following results-

AzPwXPs=
ZLZWNK=6D6AB8AECC9B28235F1DE39D879537E1

How to parse in this situation?

Share Improve this question edited May 2, 2013 at 16:12 Mike Samuel 121k30 gold badges227 silver badges252 bronze badges asked May 1, 2013 at 10:45 raviravi 6,33819 gold badges83 silver badges162 bronze badges 3
  • Jsoup only parse HTML. It cannot parse or run JS. – nhahtdh Commented May 1, 2013 at 10:47
  • @nhahtdh: Ya, I know that. That is why I am stuck in between... :( But there must some other way around – ravi Commented May 1, 2013 at 10:49
  • Run it through a JS parser? Or get a JS engine? (I actually also have the same problem on a side project, but I never got my hand around it...) – nhahtdh Commented May 1, 2013 at 10:52
Add a ment  | 

5 Answers 5

Reset to default 6

I have stumbled upon this question few times when searching for the solution to parse pages with JavaScript but the solution provided is not perfect. I have found pure Java solution to the problem by using JBrowserDriver and JSoup to parse JavaScript manipulated page.

Simple example:

    // JBrowserDriver part
    JBrowserDriver driver = new JBrowserDriver(Settings
            .builder().
            timezone(Timezone.EUROPE_ATHENS).build());
    driver.get(FETCH_URL);
    String loadedPage = driver.getPageSource();

    // JSoup parsing part
    Document document = Jsoup.parse(loadedPage);
    Elements elements = document.select("#nav-console span.data");

    log.info("Found element count: {}", elements.size());

    driver.quit();

I already had the same situation to find url's in css files.

Put the javascript in a string and a apply Regular expressions

Pattern p = Pattern.pile("url\\(\\s*(['" + '"' + "]?+)(.*?)\\1\\s*\\)"); //expression
Matcher m = p.matcher(content);
while (m.find()) {
String urlFound = m.group(); 
}

Regards, Hugo Pedrosa

Selenium's Webdriver is fantastic: http://docs.seleniumhq/docs/03_webdriver.jsp

See this answer for an example of what you are trying to do: Using Selenium Web Driver to retrieve value of a HTML input

You can try using query library. Its much more easier with it.

Once you've got the text content of the <script> element from JSoup, you can parse the JS using the Caja JS parser and then walk the parse tree to find what you're looking for.

发布评论

评论列表(0)

  1. 暂无评论