最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Fetch complete web page using java code - Stack Overflow

programmeradmin2浏览0评论

I want to implement a java method which takes URL as input and stores the entire webpage including css, images, js (all related resources) on my disk. I have used Jsoup html parser to fetch html page. Now the only option I am thinking to implement is get the page using jsoup and now parse the html content and convert relative path to absolute path and then make another get requests for javascript, images etc. and save them on disk. I also read about html cleaner, htmlunit parsers but i think in all these cases I have to parse the html content to fetch images,css and javascript files.

Any advice whether i am thinking right or not. Or is there any easy way to acplish this task ??

I want to implement a java method which takes URL as input and stores the entire webpage including css, images, js (all related resources) on my disk. I have used Jsoup html parser to fetch html page. Now the only option I am thinking to implement is get the page using jsoup and now parse the html content and convert relative path to absolute path and then make another get requests for javascript, images etc. and save them on disk. I also read about html cleaner, htmlunit parsers but i think in all these cases I have to parse the html content to fetch images,css and javascript files.

Any advice whether i am thinking right or not. Or is there any easy way to acplish this task ??

Share Improve this question asked Apr 12, 2012 at 8:27 Sachin JainSachin Jain 21.9k34 gold badges110 silver badges176 bronze badges 6
  • I found out some similar questions on SO but answer to this question is still unanswered :( – Sachin Jain Commented Apr 12, 2012 at 8:43
  • 2 Your thinking is exactly right. You might like to look at some of the source code for Apache Nutch; which is a search engine. The indexing part fetches web pages, then scans them for links (and does a whole lot of other stuff too). The code that you want will be similar but not identical. – Dawood ibn Kareem Commented Apr 12, 2012 at 8:48
  • How did you fix this? could you get what you want – user3575963 Commented Dec 1, 2015 at 20:50
  • @Clara_57S Yes, I used jsoup and it solved the problem for me. – Sachin Jain Commented Dec 2, 2015 at 10:22
  • but it cant execute javascript. – user3575963 Commented Dec 2, 2015 at 10:47
 |  Show 1 more ment

3 Answers 3

Reset to default 5

Basically, you can do it with Jsoup:

 Document doc = Jsoup.connect("http://rabotalux..ua/vacancy/4f4f800c8bc1597dc6fc7aff").get();
         Elements links = doc.select("link");
         Elements scripts = doc.select("script");
        for (Element element : links) {
              System.out.println(element.absUrl("href"));
        }
        for (Element element : scripts) {
              System.out.println(element.absUrl("src"));
        }

And so on with images and all related resources.

BUT if your site creates some elements with javaScript, Jsoup will skip it, as it cant execute javaScript

I have encountered the similar problem before couple of years where we have used exactly the same mechanism which you are planing. parse the html content and convert relative path to absolute path and also we have used multiple threads to run simultaneously and retrieve images, java script etc for performance optimization. I don't know it should done as we did or not but at the end it works for us.:-)

This GitHub project does this, using jSoup. No need to write it again if it already exists!

EDIT: I made an improved version of this class, and added new features :

It can:

  • Extract URL's from Linked or Inline CSS, eg. for background images, and download & save those too.

  • It does multithreaded downloading of all the files, (images, scripts, etc.)

  • Gives details about progress and errors.

  • Can get HTML frames embedded in the HTML document, and nested frames also.

Some caveats:

  • Uses JSoup and OkHttp, so you need to have those libraries.

  • GPL licenced, for now anyway.

发布评论

评论列表(0)

  1. 暂无评论