Go Colly

最后修改时间 2024 年 4 月 11 日

在本文中，我们将介绍如何在 Golang 中进行网页抓取和爬取。

Colly 是一个为 Golang 设计的快速网页抓取和爬取框架。它可以用于数据挖掘、数据处理或归档等任务。

Colly 具有自动的 cookie 和会话处理功能。它支持同步、异步和并行抓取。它支持缓存、尊重 robots.txt 文件，并支持分布式抓取。

Colly Collector

Collector 是 Colly 的主要接口。它负责管理网络通信，并在 collector 作业运行时执行附加的回调函数。Collector 通过 Visit 函数启动。

Colly 是一个基于事件的框架。我们将代码放在各种事件处理程序中。

Colly 具有以下事件处理程序

OnRequest - 请求发送前调用
OnError - 请求期间发生错误时调用
OnResponseHeaders - 收到响应头后调用
OnResponse - 收到响应后调用
OnHTML - 如果接收到的内容是 HTML，在 OnResponse 后立即调用
OnXML - 如果接收到的内容是 HTML 或 XML，在 OnHTML 后立即调用
OnScraped - 在所有 OnXML 回调之后调用

Go Colly 简单示例

我们从一个简单的示例开始。

title.go

package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    c := colly.NewCollector()

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println(e.Text)
    })

    c.Visit("http://webcode.me")
}

该程序检索网站的标题。

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

首先，我们导入库。

c := colly.NewCollector()

创建一个 collector。

c.OnHTML("title", func(e *colly.HTMLElement) {
     fmt.Println(e.Text)
})

在 OnHTML 处理程序中，我们注册一个匿名函数，该函数对每个 title 标签调用。我们打印 title 的文本。

c.Visit("http://webcode.me")

Visit 函数通过创建对指定 URL 的请求来启动 collector 的抓取作业。

$ go run title.go
My html page

Go Colly 事件处理程序

我们可以响应不同的事件处理程序。

callbacks.go

package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    c := colly.NewCollector()
    c.UserAgent = "Go program"

    c.OnRequest(func(r *colly.Request) {

        for key, value := range *r.Headers {
            fmt.Printf("%s: %s\n", key, value)
        }

        fmt.Println(r.Method)
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {

        fmt.Println("-----------------------------")

        fmt.Println(e.Text)
    })

    c.OnResponse(func(r *colly.Response) {

        fmt.Println("-----------------------------")

        fmt.Println(r.StatusCode)

        for key, value := range *r.Headers {
            fmt.Printf("%s: %s\n", key, value)
        }
    })

    c.Visit("http://webcode.me")
}

在代码示例中，我们为 OnRequest、OnHTML 和 OnResponse 提供了事件处理程序。

c.OnRequest(func(r *colly.Request) {

    fmt.Println("-----------------------------")

    for key, value := range *r.Headers {
        fmt.Printf("%s: %s\n", key, value)
    }

    fmt.Println(r.Method)
})

在 OnRequest 处理程序中，我们打印请求头和请求方法。

c.OnHTML("title", func(e *colly.HTMLElement) {

    fmt.Println("-----------------------------")

    fmt.Println(e.Text)
})

我们在 OnHTML 处理程序中处理 title 标签。

c.OnResponse(func(r *colly.Response) {

    fmt.Println("-----------------------------")

    fmt.Println(r.StatusCode)

    for key, value := range *r.Headers {
        fmt.Printf("%s: %s\n", key, value)
    }
})

最后，我们在 OnResponse 处理程序中打印响应的状态码及其头信息。

$ go run callbacks.go
User-Agent: [Go program]
GET
-----------------------------
200
Connection: [keep-alive]
Access-Control-Allow-Origin: [*]
Server: [nginx/1.6.2]
Date: [Sun, 23 Jan 2022 14:13:04 GMT]
Content-Type: [text/html]
Last-Modified: [Sun, 23 Jan 2022 10:39:25 GMT]
-----------------------------
My html page

Go Colly 抓取本地文件

我们可以抓取本地磁盘上的文件。

words.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document title</title>
</head>
<body>
<p>List of words</p>
<ul>
    <li>dark</li>
    <li>smart</li>
    <li>war</li>
    <li>cloud</li>
    <li>park</li>
    <li>cup</li>
    <li>worm</li>
    <li>water</li>
    <li>rock</li>
    <li>warm</li>
</ul>
<footer>footer for words</footer>
</body>
</html>

我们有这个 HTML 文件。

local.go

package main

import (
    "fmt"
    "net/http"

    "github.com/gocolly/colly/v2"
)

func main() {

    t := &http.Transport{}
    t.RegisterProtocol("file", http.NewFileTransport(http.Dir(".")))

    c := colly.NewCollector()
    c.WithTransport(t)

    words := []string{}

    c.OnHTML("li", func(e *colly.HTMLElement) {
        words = append(words, e.Text)
    })

    c.Visit("file://./words.html")

    for _, p := range words {
        fmt.Printf("%s\n", p)
    }
}

要抓取本地文件，我们必须注册一个文件协议。我们抓取列表中的所有单词。

$ go run local.go
dark
smart
war
cloud
park
cup
worm
water
rock
warm

顶级零售商

在下面的示例中，我们将获取美国顶级零售商。

retail.go

package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    c := colly.NewCollector()

    i := 0
    scan := true

    c.OnHTML("#stores-list--section-16266 td.data-cell-0,td.data-cell-1,td.data-cell-2,td.data-cell-3",
        func(e *colly.HTMLElement) {

            if scan {

                fmt.Printf("%s ", e.Text)
            }
            i++

            if i%4 == 0 && i < 40 {
                fmt.Println()
            }

            if i == 40 {
                scan = false
                fmt.Println()
            }
        })

    c.Visit("https://nrf.com/resources/top-retailers/top-100-retailers/top-100-retailers-2019")
}

该示例打印 2019 年美国排名前 10 的零售商。

c.OnHTML("#stores-list--section-16266 td.data-cell-0,td.data-cell-1,td.data-cell-2,td.data-cell-3",
    func(e *colly.HTMLElement) {

我们必须查看 HTML 源代码并确定要查找的 ID 或类。在我们的例子中，我们从表中选取了四列。

$ go run retail.go
1 Walmart $387.66  Bentonville, AR
2 Amazon.com $120.93  Seattle, WA
3 The Kroger Co. $119.70  Cincinnati, OH
4 Costco $101.43  Issaquah, WA
5 Walgreens Boots Alliance $98.39  Deerfield, IL
6 The Home Depot $97.27  Atlanta, GA
7 CVS Health Corporation $83.79  Woonsocket, RI
8 Target $74.48  Minneapolis, MN
9 Lowe's Companies $64.09  Mooresville, NC
10 Albertsons Companies $59.71  Boise, ID

Go Colly 爬取链接

在更复杂的任务中，我们需要爬取在文档中找到的链接。

links.go

package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    c := colly.NewCollector()

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println(e.Text)
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {

        link := e.Attr("href")
        c.Visit(e.Request.AbsoluteURL(link))
    })

    c.Visit("http://webcode.me/small.html")
}

对于我们的示例，我们创建了一个包含三个链接的小 HTML 文档。这三个链接不包含其他链接。我们打印每个文档的标题。

c.OnHTML("a[href]", func(e *colly.HTMLElement) {

    link := e.Attr("href")
    c.Visit(e.Request.AbsoluteURL(link))
})

在 OnHTML 处理程序中，我们查找所有链接，获取它们的 href 属性并访问它们。

$ go run links.go
Small
My html page
Something.

Stackoverflow 问题

在下一个示例中，我们将抓取 Stackoverflow 问题。

stack.go

package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

type question struct {
    title   string
    excerpt string
}

func main() {

    c := colly.NewCollector()
    qs := []question{}

    c.OnHTML("div.summary", func(e *colly.HTMLElement) {

        q := question{}
        q.title = e.ChildText("a.question-hyperlink")
        q.excerpt = e.ChildText(".excerpt")

        qs = append(qs, q)

    })

    c.OnScraped(func(r *colly.Response) {
        for idx, q := range qs {

            fmt.Println("---------------------------------")
            fmt.Println(idx + 1)
            fmt.Printf("Q: %s \n\n", q.title)
            fmt.Println(q.excerpt)
        }
    })

    c.Visit("https://stackoverflow.com/questions/tagged/perl")
}

该程序从主页打印当前的 Perl 问题。

type question struct {
    title   string
    excerpt string
}

我们创建一个结构来保存问题标题及其摘要。

qs := []question{}

创建一个切片来存放问题。

c.OnHTML("div.summary", func(e *colly.HTMLElement) {

    q := question{}
    q.title = e.ChildText("a.question-hyperlink")
    q.excerpt = e.ChildText(".excerpt")

    qs = append(qs, q)

})

我们将数据填充到结构中。

c.OnScraped(func(r *colly.Response) {
    for idx, q := range qs {

        fmt.Println("---------------------------------")
        fmt.Println(idx + 1)
        fmt.Printf("Q: %s \n\n", q.title)
        fmt.Println(q.excerpt)
    }
})

OnScraped 在所有工作完成后调用。此时，我们可以遍历切片并打印所有抓取的数据。

Go Colly 异步模式

Colly 工作的默认模式是同步模式。我们通过 Async 函数启用异步模式。在异步模式下，我们需要调用 Wait 来等待 collector 作业完成。

async.go

package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    urls := []string{
        "http://webcode.me",
        "https://example.com",
        "http://httpbin.org",
        "https://perl.net.cn",
        "https://php.ac.cn",
        "https://pythonlang.cn",
        "https://vscode.js.cn",
        "https://clojure.org",
    }

    c := colly.NewCollector(
        colly.Async(),
    )

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println(e.Text)
    })

    for _, url := range urls {

        c.Visit(url)
    }

    c.Wait()

}

请注意，每次运行程序时，返回的标题顺序都会不同。

来源

Go Colly - Golang 的抓取框架

在本文中，我们使用 Colly 在 Golang 中进行了网页抓取和爬取。

作者

我叫 Jan Bodnar，我是一名充满激情的程序员，拥有丰富的编程经验。我自 2007 年以来一直在撰写编程文章。至今，我已撰写超过 1400 篇文章和 8 本电子书。我在编程教学方面拥有十多年的经验。

列出所有 Go 教程。