最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

c# - Get HtmlDocument after javascript manipulations - Stack Overflow

programmeradmin0浏览0评论

In C#, using the System.Windows.Forms.HtmlDocument class (or another class that allows DOM parsing), is it possible to wait until a webpage finishes its javascript manipulations of the HTML before retrieving that HTML? Certain sites add innerhtml to pages through javascript, but those changes do not show up when I parse the HtmlElements of the HtmlDocument.

One possibility would be to update the HtmlDocument of the page after a second. Does anybody know how to do this?

In C#, using the System.Windows.Forms.HtmlDocument class (or another class that allows DOM parsing), is it possible to wait until a webpage finishes its javascript manipulations of the HTML before retrieving that HTML? Certain sites add innerhtml to pages through javascript, but those changes do not show up when I parse the HtmlElements of the HtmlDocument.

One possibility would be to update the HtmlDocument of the page after a second. Does anybody know how to do this?

Share Improve this question edited Jan 29, 2014 at 2:02 noseratio 61.8k36 gold badges223 silver badges500 bronze badges asked Oct 13, 2011 at 16:11 carlbensoncarlbenson 3,2076 gold badges37 silver badges55 bronze badges
Add a ment  | 

5 Answers 5

Reset to default 2

Someone revived this question by posting what I think is an incorrect answer. So, here are my thoughts to address it.

Non-deterministically, it's possible to get close to finding out if the page has finished its AJAX stuff. However, it pletely depends on the logic of that particular page: some pages are perpetually dynamic.

To approach this, one can handle DocumentCompleted event first, then asynchronously poll the WebBrowser.IsBusy property and monitor the current HTML snapshot of the page for changes, like below.

The plete sample can be found here.

// get the root element
var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];

// poll the current HTML for changes asynchronosly
var html = documentElement.OuterHtml;
while (true)
{
    // wait asynchronously, this will throw if cancellation requested
    await Task.Delay(500, token); 

    // continue polling if the WebBrowser is still busy
    if (this.webBrowser.IsBusy)
        continue; 

    var htmlNow = documentElement.OuterHtml;
    if (html == htmlNow)
        break; // no changes detected, end the poll loop

    html = htmlNow;
}

In general aswer is "no" - unless script on the page notifies your code in some way you have to simply wait some time and grab HTML. Waiting a second after document ready notification likley will cover most sites (i.e. jQuery's $(code) cases).

You need to give the application a second to process the Java. Simply halting the current thread will delay the java processing as well so your doc will still e up outdated.

WebBrowserDocumentCompletedEventArgs cachedLoadArgs;

private void TimerDone(object sender, EventArgs e)
{
    ((Timer)sender).Stop();
    respondToPageLoaded(cachedLoadArgs);
}

void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    cachedLoadArgs = e;

    System.Windows.Forms.Timer timer = new Timer();

    int interval = 1000;

    timer.Interval = interval;
    timer.Tick += new EventHandler(TimerDone);
    timer.Start();
}

What about using 'WebBrowser.Navigated' event?

I made with WEbBrowser take a look at my class:

public class MYCLASSProduct: IProduct
{
    public string Name { get; set; }
    public double Price { get; set; }
    public string Url { get; set; }

    private WebBrowser _WebBrowser;
    private AutoResetEvent _lock;

    public void Load(string url)
    {
        _lock = new AutoResetEvent(false);
        this.Url = url;

        browserInitializeBecauseJavascriptLoadThePage();
    }

    private void browserInitializeBecauseJavascriptLoadThePage()
    {
        _WebBrowser = new WebBrowser();
        _WebBrowser.DocumentCompleted += webBrowser_DocumentCompleted;
        _WebBrowser.Dock = DockStyle.Fill;
        _WebBrowser.Name = "webBrowser";
        _WebBrowser.ScrollBarsEnabled = false;
        _WebBrowser.TabIndex = 0;
        _WebBrowser.Navigate(Url);

        Form form = new Form();
        form.Hide();
        form.Controls.Add(_WebBrowser);

        Application.Run(form);
        _lock.WaitOne();
    }

    private void webBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
    {
        HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
        hDocument.LoadHtml(_WebBrowser.Document.Body.OuterHtml);
        this.Price = Convert.ToDouble(hDocument.DocumentNode.SelectNodes("//td[@class='ask']").FirstOrDefault().InnerText.Trim());
        _WebBrowser.FindForm().Close();
        _lock.Set();

    }

if your trying to do this in a console application, you need to put this tag above your main, because Windows needs to municate with COM Components:

[STAThread]
    static void Main(string[] args)

I did not like this solution, But I think that is no one better!

发布评论

评论列表(0)

  1. 暂无评论