User Tools

Site Tools

study:anglesharp:20250311-005:index

PuppeteerSharp+AngleSharp的爬蟲實戰之汽車之家資料抓取 (2025-03-11)

Local Backup

  • 參考了DotNetSpider範例,感覺DotNetSpider太重了,它是一個比較完整的爬蟲框架。
  • 比較了以下各種無頭瀏覽器,最終採用PuppeteerSharp+AngleSharp寫一個爬蟲範例。
  • 和上面的部落格文章一樣,都是用汽車之家的https://store.mall.autohome.com.cn/83106681.html這個頁面做資料收集範例。
  • 本文中使用PuppeteerSharp取得最終頁面(即載入JavaScript之後的頁面),使用AngleSharp進行Html documents解析處理。

無頭瀏覽器

  • 現存(幾乎)所有無頭網頁瀏覽器的列表
  • 沒有圖形使用者介面、由程式控制的網頁瀏覽器。用於自動化、測試和其他目的。

瀏覽器引擎

  • 這些瀏覽器引擎可以完全呈現網頁或在虛擬 DOM 中執行 JavaScript
姓名 關於 支援的語言 執照
Chromium CEF 是一個基於 Google Chromium 專案的開源專案。 JavaScript BSD
Erik 基於 Kanna 和 WebKit 的無頭瀏覽器。 Swift MIT
jBrowserDriver 一個用純 Java 寫的與 Selenium 相容的無頭瀏覽器。基於 WebKit。可與任何 Selenium 伺服器綁定一起使用。 Java Apache License v2.0
PhantomJS [ 未維護 ] PhantomJS 是無頭 WebKit,可使用 JavaScript API 編寫腳本。它對各種 Web 標準提供快速且原生的支援:DOM 處理、CSS 選擇器、JSON、Canvas 和 SVG。 JavaScript、Python、Ruby、Java、C#、Haskell、Objective-C、Perl、PHP、R(透過Selenium BSD 3-Clause
Splash Splash 是一個帶有 HTTP API 的 javascript 渲染服務。它是一個具有 HTTP API 的輕量級瀏覽器,使用 Twisted 和 QT 以 Python 實作。 Any BSD 3-Clause

多驅動器

  • 這些庫可以控制多個瀏覽器引擎(通常使用 Selenium)
姓名 關於 支援的語言 執照
CasperJS CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). JavaScript MIT
Geb Geb is a Groovy interface to WebDriver. Groovy Apache
Selenium Selenium is a suite of tools to automate web browsers across many platforms. JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP, R Apache
Splinter Splinter is an open source tool for testing web applications using Python. It lets you automate browser actions, such as visiting URLs and interacting with their items. Python -
SST SST (selenium-simple-test) is a web test framework that uses Python to generate functional browser-based tests. Python -
Watir The most elegant way to use Selenium WebDriver with ruby. Ruby MIT

PhantomJS 驅動程式

  • 這些庫控制 PhantomJS
Name About Supported Languages License
Ghostbuster Automated browser testing via phantom.js, with all of the pain taken out! That means you get a real browser, with a real DOM, and can do real testing! JavaScript Not specified
jedi-crawler Lightsabing Node/PhantomJS crawler; scrape dynamic content : without the hassle JavaScript Not specified
Lotte Lotte is a headless, automated testing framework built on top of PhantomJS and inspired by Ghostbuster. JavaScript MIT
phantompy Phantompy is a headless WebKit engine with powerful pythonic api build on top of Qt5 Webkit Python LGPL-2.1
X-RAY Supports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP) JavaScript MIT
Horseman Promise based Node.js module for PhantomJS. Features chainable API, understandable control-flow, support for multiple tabs, and built-in jQuery. JavaScript MIT

Chromium 驅動程式

  • 這些庫控制 Chromium
Name About Supported Languages License
Awesomium Chromium-based headless browser engine C++, Free/Commercial
Headless Chromium Chromium feature activated with the –headlesss flag, currently availible in the nightly build of Chromium, not yet released C++ Opensource
Puppeteer Headless Chrome Node API from the Chrome DevTools team JavaScript Apache
PuppeteerSharp PuppeteerSharp is a port of the official Headless Chrome Node.JS Puppeteer API MIT
chrome-remote-interface Chrome Debugging Protocol interface for Node.js JavaScript MIT
Chromy Features chainable API, mobile emulation, fundamental API such as javascript evaluation. JavaScript MIT
chromedp A faster, simpler way to drive browsers (Chrome, Edge, Safari, Android, etc) without external dependencies (ie, Selenium, PhantomJS, etc) using the Chrome Debugging Protocol. Go MIT
Chromeless Chrome automation made simple. Runs locally or headless on AWS Lambda. JavaScript MIT

Webkit 驅動程式

  • 這些驅動程式控制 Webkit 的進程內實例
Name About Supported Languages License
Browserjet Runs a custom build of webkit, controlled by node.js interface. JavaScript Not specified
ghost.py ghost.py is a webkit web client written in python. Python MIT
headless_browser Headless browser based on WebKit written in C++. C++ Not Specified
Jabba-Webkit Jabba's headless webkit browser for scraping AJAX-powered webpages. Python Not specified
Jasmine-Headless-Webkit jasmine-headless-webkit uses the QtWebKit widget to run your specs without needing to render a pixel. Python, JavaScript, Ruby Free
Python-Webkit Python-Webkit is a python extension to Webkit to add full, complete access to Webkit's DOM Python GNU
Spynner Programmatic web browsing module with AJAX support for Python Python Not specified
Webloop Scriptable, headless WebKit with a Go API. Go BSD 3-Clause
wkhtmltopdf wkhtmltox wkhtmltoimage Command line tool rendering HTML into PDF and other image formats. shell, C LGPLv3
WKZombie Functional headless browser (with JSON support) for iOS using WebKit and hpple/libxml2. Swift MIT

其他驅動因素

  • 這些庫控制鮮為人知的瀏覽器或作業系統提供的 Web 庫
Name About Supported Languages License
Nightmare Nightmare is a high-level browser automation library built as an easier alternative to PhantomJS. It runs on the Electron engine. JavaScript MIT
grope A RubyCocoa interface to the macOS WebKit Framework RubyCocoa MIT
SlimerJS SlimerJS is similar to PhantomJs, except that it runs Gecko, the browser engine of Mozilla Firefox, instead of Webkit (And it is not yet truly headless). JavaScript Mozilla 2.0
SpecterJS A scriptable headless Internet Explorer port of PhantomJS. JavaScript MIT
trifleJS A headless Internet Explorer browser using the WebBrowser Class with a Javascript API running on the V8 engine. JavaScript MIT

偽造的瀏覽器引擎

  • 這些庫通常是簡單瀏覽器或僅支援 HTML 的瀏覽器
Name About Supported Languages License
AngleSharp Http Parsing Library MIT
Guillotine A headless browser, written in C# LGPL-3.0
benv Stub a browser environment in node.js and headlessly test your client-side code. JavaScript MIT
browser.rb Headless Ruby browser on top of Nokogiri and TheRubyRacer Ruby Not specified
BrowserKit BrowserKit simulates the behavior of a web browser. PHP MIT
DamonJS Bot navigating urls and doing tasks. JavaScript Apache
Headless Headless browser support for fast web acceptance testing in MIT
HeadlessBrowser A very miniature headless browser, for testing the DOM on Node.js JavaScript Not specified
HtmlUnit HtmlUnit is a “GUI-Less browser for Java programs”. Java Apache
Jaunt Java Web Scraping & Automation API Java Not specified
JSDom A JavaScript implementation of the WHATWG DOM and HTML standards, for use with Node.js. JavaScript MIT
MechanicalSoup A Python library for automating interaction with websites. Python MIT
mechanize Stateful programmatic web browsing. Python BSD 3-Clause, ZPL 2.1
node-as-browser Create a browser-like environment within Node.js JavaScript MIT
RoboBrowser A simple, Pythonic library for browsing the web without a standalone web browser. Python BSD 3-Clause
SimpleBrowser A flexible and intuitive web browser engine designed for automation tasks. Built on the 4 framework. BSD 3-Clause
stanislaw Naive, mechanize-like HTML parser/form driver. Python Not specified
twill Twill is a simple language that interacts with basic HTML pages (no JavaScript support). Python MIT
WeasyPrint WeasyPrint is a visual rendering engine for HTML and CSS that can export to PDF. It aims to support web standards for printing. Python BSD 3-Clause
WWW::Mechanize Headless browser for Perl with many plugins and extensions, notably Test::WWW:Mechanize for testing Perl Perl 5
X-RAY Supports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP) JavaScript MIT
Xidel (Internet Tools) An XQuery-based cli web scraper for static X/HTML pages and JSON-APIs. FreePascal, XQuery GPL-2
Zombie.js Zombie.js is a lightweight framework for testing client-side JavaScript code in a simulated environment. No browser required. JavaScript MIT

在瀏覽器中執行

Name About Supported Languages License
DalekJS [unmaintained and recommend TestCafé] Automated cross browser testing with JavaScript. JavaScript MIT
TestCafé Automated browser testing for the modern web development stack. JavaScript MIT
Sahi Sahi is a cross-browser automation/testing tool with the facility to record and playback scripts. JavaScript, Java, Ruby, PHP Apache / Commercial
WatiN Web Application Testing In Apache 2.0

雜項工具

Name About Supported Languages License
browser-launcher Detect and launch browser versions, headlessly or otherwise JavaScript MIT
  • 其實如果沒有JavaScripts載入資料需求,單獨用AngleSharp就可以搞定了。
  • 但涉及到JavaScripts載入資料需求的,就需要上真正的無頭瀏覽器元件才能搞定了。
  • AngleSharp現在只支援簡單的JavaScripts程式碼執行,稍微複雜點的,都不行,聽說以後要完整支持JavaScripts,敬請期待吧!

程式碼

  • /*
     * This is a Puppeteer+AngleSharp crawler console app samples
     */
    using System;
    using System.Collections.Generic;
    using System.Threading.Tasks;
    using AngleSharp;
    using AngleSharp.Dom;
    using AngleSharp.Html.Parser;
    using Newtonsoft.Json;
    using PuppeteerSharp;
    
    namespace CrawlerSamples
    {
        internal class Program
        {
            private const string Url = "https://store.mall.autohome.com.cn/83106681.html";
            private const int ChromiumRevision = BrowserFetcher.DefaultRevision;
    
            private static async Task Main(string[] args)
            {
                //Download chromium browser revision package
                await new BrowserFetcher().DownloadAsync(ChromiumRevision);
    
                //Test AngleSharp
                await TestAngleSharp();
    
                Console.ReadKey();
            }
    
            private static async Task TestAngleSharp()
            {
                /*
                 * Used AngleSharp loading of HTML document
                 * TODO: Used WithJavaScript function need install AngleSharp.Scripting.Javascript nuget package
                 * Note: that JavaScripts support is an experimental and does not support complex JavaScripts code.
                 */
                //IConfiguration config = Configuration.Default.WithDefaultLoader().WithCss().WithCookies().WithJavaScript();
                //IBrowsingContext context = BrowsingContext.New(config);
                //IDocument document = await context.OpenAsync(url);
    
                //Used PuppeteerSharp loading of HTML document
                var htmlString = await TestPuppeteerSharp();
    
                /*
                 * Parsing of HTML document string
                 */
                var context = BrowsingContext.New(Configuration.Default);
                var parser = context.GetService<IHtmlParser>();
                var document = parser.ParseDocument(htmlString);
    
                //Selector carbox element list
                var carboxList = document.QuerySelectorAll("div.shop-content div.content div.list li.carbox");
    
                var carModelList = new List<CarModel>();
                foreach (var carbox in carboxList)
                {
                    //Parsing and converting to the car model object.
                    var model = CreateModelWithAngleSharp(carbox);
                    carModelList.Add(model);
    
                    //Printing to console windows
                    var jsonString = JsonConvert.SerializeObject(model);
                    Console.WriteLine(jsonString);
                    Console.WriteLine();
                }
    
                Console.WriteLine("Total count:" + carModelList.Count);
            }
    
            private static async Task<string> TestPuppeteerSharp()
            {
                //Enabled headless option
                var launchOptions = new LaunchOptions { Headless = true };
                //Starting headless browser
                var browser = await Puppeteer.LaunchAsync(launchOptions);
    
                //New tab page
                var page = await browser.NewPageAsync();
                //Request URL to get the page
                await page.GoToAsync(Url);
    
                //Get and return the HTML content of the page
                var htmlString = await page.GetContentAsync();
    
                #region Dispose resources
                //Close tab page
                await page.CloseAsync();
    
                //Close headless browser, all pages will be closed here.
                await browser.CloseAsync();
                #endregion
    
                return htmlString;
            }
    
            private static CarModel CreateModelWithAngleSharp(IParentNode node)
            {
                var model = new CarModel
                {
                    Title = node.QuerySelector("a div.carbox-title").TextContent,
                    ImageUrl = node.QuerySelector("a div.carbox-carimg img").GetAttribute("src"),
                    ProductUrl = node.QuerySelector("a").GetAttribute("href"),
                    Tip = node.QuerySelector("a div.carbox-tip").TextContent,
                    OrdersNumber = node.QuerySelector("a div.carbox-number span").TextContent
                };
    
                return model;
            }
        }
    }

結果

筆記

  • 注意,第一次運行,這一句程式碼:
  • await new BrowserFetcher().DownloadAsync(ChromiumRevision);
  • 會從網路上下載瀏覽器便利式安裝包download-Win64-536395.zip到你本地,裡面解壓縮後是一個Chromium瀏覽器。這裡需要等待一些時間。

來源

Permalink study/anglesharp/20250311-005/index.txt · Last modified: 2025/03/11 10:58 by jethro

oeffentlich