在 Java 中读取网页

最后修改于 2024 年 1 月 27 日

在 Java 中读取网页是一个教程，介绍了在 Java 中读取网页的几种方法。它包含了从小型网页下载 HTTP 源代码的七个示例。

Java 读取网页的工具

Java 具有用于读取/下载网页的内置工具和第三方库。在示例中，我们使用 HttpClient、URL、JSoup、HtmlCleaner、Apache HttpClient、Jetty HttpClient 和 HtmlUnit。

在以下示例中，我们从 webcode.me 这个小型网页下载 HTML 源代码。

Java 使用 HttpClient 读取网页

Java 11 引入了 HttpClient 库。

com/zetcode/ReadWebPage.java

package com.zetcode;

import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

public class ReadWebPage {

    public static void main(String[] args) throws IOException, InterruptedException {

        HttpClient client = HttpClient.newHttpClient();
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create("http://webcode.me"))
                .GET() // GET is default
                .build();

        HttpResponse<String> response = client.send(request,
                HttpResponse.BodyHandlers.ofString());

        System.out.println(response.body());
    }
}

我们使用 Java HttpClient 下载网页。

HttpClient client = HttpClient.newHttpClient();

使用 newHttpClient 工厂方法创建一个新的 HttpClient。

HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("http://webcode.me"))
    .build();

我们构建一个到网页的同步请求。默认方法是 GET。

HttpResponse<String> response = client.send(request,
    HttpResponse.BodyHandlers.ofString());

System.out.println(response.body());

我们发送请求并检索响应的内容并将其打印到控制台。因为我们期望一个字符串 HTML 响应，所以我们使用 HttpResponse.BodyHandlers.ofString。

使用 URL 读取网页

URL 表示统一资源定位符，是指向万维网上资源的指针。

com/zetcode/ReadWebPageEx.java

package com.zetcode;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

public class ReadWebPageEx {

    public static void main(String[] args) throws IOException {

        var url = new URL("http://webcode.me");
        try (var br = new BufferedReader(new InputStreamReader(url.openStream()))) {

            String line;

            var sb = new StringBuilder();

            while ((line = br.readLine()) != null) {

                sb.append(line);
                sb.append(System.lineSeparator());
            }

            System.out.println(sb);
        }
    }
}

此代码示例读取网页的内容。

try (var br = new BufferedReader(new InputStreamReader(url.openStream()))) {

openStream 方法打开与指定 URL 的连接，并返回一个 InputStream 以便从该连接读取数据。 InputStreamReader 是从字节流到字符流的桥梁。它读取字节并使用指定的字符集将它们解码为字符。此外，BufferedReader 用于提高性能。

var sb = new StringBuilder();

while ((line = br.readLine()) != null) {

    sb.append(line);
    sb.append(System.lineSeparator());
}

HTML 数据使用 readLine 方法逐行读取。源代码附加到 StringBuilder。

System.out.println(sb);

最后，StringBuilder 的内容被打印到终端。

使用 JSoup 读取网页

JSoup 是一个流行的 Java HTML 解析器。

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.12.1</version>
</dependency>

我们使用了这个 Maven 依赖项。

com/zetcode/ReadWebPageEx2.java

package com.zetcode;

import org.jsoup.Jsoup;
import java.io.IOException;

public class ReadWebPageEx2 {

    public static void main(String[] args) throws IOException {

        String webPage = "http://webcode.me";

        String html = Jsoup.connect(webPage).get().html();

        System.out.println(html);
    }
}

此代码示例使用 JSoup 下载并打印一个小型网页。

String html = Jsoup.connect(webPage).get().html();

connect 方法连接到指定的网页。 get 方法发出一个 GET 请求。最后，html 方法检索 HTML 源代码。

使用 HtmlCleaner 读取网页

HtmlCleaner 是一个用 Java 编写的开源 HTML 解析器。

<dependency>
    <groupId>net.sourceforge.htmlcleaner</groupId>
    <artifactId>htmlcleaner</artifactId>
    <version>2.16</version>
</dependency>

在此示例中，我们使用 htmlcleaner Maven 依赖项。

com/zetcode/ReadWebPageEx3.java

package com.zetcode;

import java.io.IOException;
import java.net.URL;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.SimpleHtmlSerializer;
import org.htmlcleaner.TagNode;

public class ReadWebPageEx3 {

    public static void main(String[] args) throws IOException {

        var url = new URL("http://webcode.me");

        var props = new CleanerProperties();
        props.setOmitXmlDeclaration(true);

        var cleaner = new HtmlCleaner(props);
        TagNode node = cleaner.clean(url);

        var htmlSerializer = new SimpleHtmlSerializer(props);
        htmlSerializer.writeToStream(node, System.out);
    }
}

该示例使用 HtmlCleaner 下载网页。

var props = new CleanerProperties();
props.setOmitXmlDeclaration(true);

在属性中，我们设置为省略 XML 声明。

var htmlSerializer = new SimpleHtmlSerializer(props);
htmlSerializer.writeToStream(node, System.out);

SimpleHtmlSerializer 创建生成的 HTML，而没有任何缩进和/或压缩。

使用 Apache HttpClient 读取网页

Apache HttpClient 是一个符合 HTTP/1.1 协议的 HTTP 代理实现。它可以利用请求和响应过程来抓取网页。 HTTP 客户端实现了 HTTP 和 HTTPS 协议的客户端。

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.10</version>
</dependency>

我们为 Apache HTTP 客户端使用此 Maven 依赖项。

com/zetcode/ReadWebPageEx4.java

package com.zetcode;

import java.io.IOException;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;

public class ReadWebPageEx4 {

    public static void main(String[] args) throws IOException {

        HttpGet request = null;

        try {

            String url = "http://webcode.me";
            HttpClient client = HttpClientBuilder.create().build();
            request = new HttpGet(url);

            request.addHeader("User-Agent", "Apache HTTPClient");
            HttpResponse response = client.execute(request);

            HttpEntity entity = response.getEntity();
            String content = EntityUtils.toString(entity);
            System.out.println(content);

        } finally {

            if (request != null) {

                request.releaseConnection();
            }
        }
    }
}

在此代码示例中，我们向指定的网页发送一个 GET HTTP 请求，并接收一个 HTTP 响应。从响应中，我们读取 HTML 源代码。

HttpClient client = HttpClientBuilder.create().build();

构建一个 HttpClient。

request = new HttpGet(url);

HttpGet 是 HTTP GET 方法的类。

request.addHeader("User-Agent", "Apache HTTPClient");
HttpResponse response = client.execute(request);

执行 GET 方法并接收 HttpResponse。

HttpEntity entity = response.getEntity();
String content = EntityUtils.toString(entity);
System.out.println(content);

从响应中，我们获得网页的内容。

使用 Jetty HttpClient 读取网页

Jetty 项目也有一个 HTTP 客户端。

<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-client</artifactId>
    <version>9.4.25.v20191220</version>
</dependency>

这是 Jetty HTTP 客户端的 Maven 依赖项。

com/zetcode/ReadWebPageEx5.java

package com.zetcode;

import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;

public class ReadWebPageEx5 {

    public static void main(String[] args) throws Exception {

        HttpClient client = null;

        try {

            client = new HttpClient();
            client.start();

            String url = "http://webcode.me";

            ContentResponse res = client.GET(url);

            System.out.println(res.getContentAsString());

        } finally {

            if (client != null) {

                client.stop();
            }
        }
    }
}

在此示例中，我们使用 Jetty HTTP 客户端获取网页的 HTML 源代码。

client = new HttpClient();
client.start();

创建并启动一个 HttpClient。

ContentResponse res = client.GET(url);

向指定的 URL 发出 GET 请求。

System.out.println(res.getContentAsString());

使用 getContentAsString 方法从响应中检索内容。

使用 HtmlUnit 读取网页

HtmlUnit 是一个 Java 单元测试框架，用于测试基于 Web 的应用程序。

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.36.0</version>
</dependency>

我们使用这个 Maven 依赖项。

com/zetcode/ReadWebPageEx6.java

package com.zetcode;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebResponse;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.IOException;

public class ReadWebPageEx6 {

    public static void main(String[] args) throws IOException {

        try (var webClient = new WebClient()) {

            String url = "http://webcode.me";

            HtmlPage page = webClient.getPage(url);
            WebResponse response = page.getWebResponse();
            String content = response.getContentAsString();

            System.out.println(content);
        }
    }
}

该示例下载一个网页并使用 HtmlUnit 库打印它。

来源

Java HttpClient

在本文中，我们使用各种工具在 Java 中抓取了一个网页，包括 HttpClient、URL、JSoup、HtmlCleaner、Apache HttpClient、Jetty HttpClient 和 HtmlUnit。

作者

我叫 Jan Bodnar，是一位充满热情的程序员，拥有丰富的编程经验。我自 2007 年以来一直撰写编程文章。迄今为止，我已经撰写了 1,400 多篇文章和 8 本电子书。我在编程教学方面拥有超过十年的经验。

列出所有Java教程。