最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Parsing JavaScript Web Page in C# with AngleSharp - Stack Overflow

programmeradmin1浏览0评论

The webpage use javascript to build its html so I need html parser with js support.
I found anglesharp but I can't make it working.

using AngleSharp;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;

namespace AngleSharpScraping
{
    class Program
    {
        static void Main(string[] args)
        {
            GetMkvToolNix();
            Console.ReadKey();
        }

        static async void GetMkvToolNix()
        {
            // Create a new configuration with javascript interpreter.
            var config = new Configuration().WithJavaScript();

            // Parsing process.
            var document = await BrowsingContext.New(config).OpenAsync(Url.Create(".html"));
            var link = document.QuerySelector("body > div.container.page-content > div > div.col-sm-9 > article > div.main-dl-box > p:nth-child(2) > a.dwl-link.xlink").GetAttribute("data");

            Console.WriteLine(link);
        }
    }
}

The webpage use javascript to build its html so I need html parser with js support.
I found anglesharp but I can't make it working.

using AngleSharp;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;

namespace AngleSharpScraping
{
    class Program
    {
        static void Main(string[] args)
        {
            GetMkvToolNix();
            Console.ReadKey();
        }

        static async void GetMkvToolNix()
        {
            // Create a new configuration with javascript interpreter.
            var config = new Configuration().WithJavaScript();

            // Parsing process.
            var document = await BrowsingContext.New(config).OpenAsync(Url.Create("http://www.fosshub./MKVToolNix.html"));
            var link = document.QuerySelector("body > div.container.page-content > div > div.col-sm-9 > article > div.main-dl-box > p:nth-child(2) > a.dwl-link.xlink").GetAttribute("data");

            Console.WriteLine(link);
        }
    }
}
Share Improve this question edited Sep 6, 2015 at 20:02 Lucas Trzesniewski 51.5k11 gold badges113 silver badges167 bronze badges asked Jun 7, 2015 at 17:43 baltazerbaltazer 2591 gold badge5 silver badges12 bronze badges 2
  • May want to look into PhantomJS – AlliterativeAlice Commented Jun 7, 2015 at 17:48
  • 1 PhantomJS is an external application with js api. Also some antivirus see it as a threat and show ugly warning popups. – baltazer Commented Jun 7, 2015 at 18:53
Add a ment  | 

2 Answers 2

Reset to default 5

AngleSharp alone only provides an HTML and CSS parser. However, AngleSharp may be extended with JavaScript capabilities. Right now the package you've used (AngleSharp.Scripting.JavaScript) is experimental and more or less a proof of concept.

The JavaScript files on the page are still too plex for the experimental support. It is my effort to enable support for such scenarios as soon as possible, but right now I would say that WebKit.NET is probably your best shot for solving your problem.

Another possible solution might be to use the C# driver for Selenium.

Unrelated to the whole JavaScript topic: If you want to load external resources you need to provide a proper (http) requester. The easiest way to do that is by using the default one:

var config = new Configuration().WithDefaultLoader();
var document = await BrowsingContext.New(config).OpenAsync("http://www.fosshub./MKVToolNix.html");
// ...

In this setting external documents are loaded, but other resources (e.g., images, scripts, ...) are not loaded.

AngleSharp is a text parser. If you want to scrape dynamic web pages with JS, you'll need a headless browser.

This answer provides a couple of options (at least one free and open source: WebKit.NET).

发布评论

评论列表(0)

  1. 暂无评论