C#'s first experience with crawler based on the. Net-HtmlAgilityPack library

C#'s first experience with crawler based on the. Net-HtmlAgilityPack library

I looked at some of the. Net open source libraries on github and saw related libraries on crawler

最后更新 6/9/2022 10:54 PM
黑哥聊dotNet
预计阅读 5 分钟
分类
.NET
标签
.NET C# open source GitHub

storytelling

A few days ago, I had some free time. I was reading some of the open source libraries of. Net on github. I saw the libraries related to crawlers, so I joined a QQ group. I saw that the big guys inside were discussing that the better you crawl, the faster you enter. So I also wanted to make something related to crawlers myself, but crawlers are dangerous things, and I don't dare to crawl other people's websites casually, so I found a friend and used his website to practice!

practice

对于.Net 来说,爬虫相关的库还是蛮多的,于是我选择了HtmlAgilityPack来做一个爬虫练习!

** Of course, what is a reptile? **

In short:

The basic process of the crawler is: download data (send an HTTP request and get the returned response) -> parse the returned text (can be text, json, html) -> store the parsed data

学习一个框架,我们肯定是从它的官方文档开始, 地址:https://html-agility-pack.net/

HTML parser

  • From File
  • From String (Load HTML document from the specified string)
  • From the Web (Get HTML documents from Internet resources)
  • From Browser

So I chose From Web to parse our html document, the code is as follows:

var html = @"https://dotnet9.com/";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);

Now that we have obtained the Html document, we must definitely parse the Html content.

Html selector

  • SelectNodes()(Select the list of nodes that match the XPath expression)
  • SelectSingleNode(String)(Select the first XmlNode that matches the XPath expression)

Open the website and find the website we want to crawl. Today we will crawl all the articles under the special album of this website.

打开调试模式,我们可以看到特色专辑是一个a标签,我们再来查看该标签的上一级元素是lili上一级元素是ul,那我们就可以来获取该节点

var allNodes = htmlDoc.DocumentNode.SelectNodes("//ul[@id='starlist']//li[@class='menu']");

当然我们也可以使用Xpath来获取节点内容

var singNodes = htmlDoc.DocumentNode.SelectSingleNode("/html[1]/body[1]/header[1]/div[3]/nav[1]/ul[1]/li[3]//ul[1]")

然后我们再来获取该特色专辑下的子菜单的的网址,经发现,a标签href 属性规定链接的目标地址,那我们第一步肯定是要获取该子菜单下的所有链接!

var singNodes = htmlDoc.DocumentNode.SelectSingleNode("/html[1]/body[1]/header[1]/div[3]/nav[1]/ul[1]/li[3]//ul[1]")
    .ChildNodes.Where(o => o.Name=="li");

List<string> lstUrl = new List<string>();
foreach (var item in singNodes)
{
    var aNodes = item.ChildNodes.Where(o => o.Name == "a").First();
    string url = aNodes.Attributes["href"].Value;
    lstUrl.Add(url);
}

Open a submenu at will and you can see relevant article title descriptions and pictures! This is what we want! The analysis method is still the same as before! The code is as follows:

foreach (var item in lstUrl)
{
    htmlDoc = web.Load("https://dotnet9.com"+item);
    var resultNodes = htmlDoc.DocumentNode.SelectSingleNode("//div[@class='pics-list-box whitebg']//ul")
        .ChildNodes.Where(o=>o.Name=="li");
    foreach (var itemResultNodes in resultNodes)
    {
        WebData webData = new WebData();
        var aNodes = itemResultNodes.ChildNodes.Where(o => o.Name == "a").First();
        webData.Url= aNodes.Attributes["href"].Value;
        webData.Title = aNodes.ChildNodes["h2"].InnerText;
        webData.Desc = aNodes.ChildNodes["p"].InnerText;
        webData.Img = aNodes.ChildNodes["i"].ChildNodes["img"].Attributes["src"].Value;
        Console.WriteLine($"标题:{webData.Title}-描述:{webData.Desc}-Img:{webData.Img}-{webData.Url}\r\n");
    }
}

This way we can get what we want! Run the code and our first crawler succeeded.

summary

I improvised and wrote the first crawler. If you have a better plan, you are welcome to communicate. It is better to have fun alone than everyone else. This article is over, and I hope it will be helpful to you.

Finally, let's make a statement: Generally speaking, technology is innocent, but if you use technology to crawl into other people's privacy and business data, you are contempt for the law. Please keep your bottom line!

Keep Exploring

延伸阅读

更多文章
同分类 / 同标签 4/22/2026

Support for. NET by operating system versions (250707 update)

Use virtual machines and test machines to test the support of each version of the operating system for. NET. After installing the operating system, it is passed by measuring the corresponding running time of the installation and being able to run the Stardust Agent.

继续阅读