Python Requests

最后修改于 2024 年 1 月 29 日

在本文中，我们将展示如何使用 Python Requests 模块。我们获取数据、发布数据、流式传输数据以及连接到安全的网页。在示例中，我们使用一个在线服务、一个 Nginx 服务器、一个 Python HTTP 服务器和一个 Flask 应用程序。

超文本传输协议 (HTTP) 是一种用于分布式、协作式、超媒体信息系统的应用协议。HTTP 是万维网数据通信的基础。

Python requests

Requests 是一个简单而优雅的 Python HTTP 库。它提供了通过 HTTP 访问 Web 资源的方法。

$ sudo service nginx start

我们在本地主机上运行 Nginx Web 服务器。我们的一些示例使用了 nginx 服务器。

Python requests 版本

第一个程序打印 Requests 库的版本。

version.py

#!/usr/bin/python

import requests

print(requests.__version__)
print(requests.__copyright__)

程序打印 Requests 的版本和版权信息。

$ ./version.py
2.21.0
Copyright 2018 Kenneth Reitz

这是该示例的示例输出。

Python requests 读取网页

get 方法发出一个 GET 请求；它获取由给定 URL 标识的文档。

read_webpage.py

#!/usr/bin/python

import requests as req

resp = req.get("http://www.webcode.me")

print(resp.text)

该脚本抓取 www.webcode.me 网页的内容。

resp = req.get("http://www.webcode.me")

get 方法返回一个响应对象。

print(resp.text)

text 属性包含响应的内容，格式为 Unicode。

$ ./read_webpage.py
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My html page</title>
</head>
<body>

    <p>
        Today is a beautiful day. We go swimming and fishing.
    </p>

    <p>
         Hello there. How are you?
    </p>

</body>
</html>

这是 read_webpage.py 脚本的输出。

以下程序获取一个小型网页并去除其 HTML 标签。

strip_tags.py

#!/usr/bin/python

import requests as req
import re

resp = req.get("http://www.webcode.me")

content = resp.text

stripped = re.sub('<[^<]+?>', '', content)
print(stripped)

该脚本去除 www.webcode.me 网页的 HTML 标签。

stripped = re.sub('<[^<]+?>', '', content)

使用一个简单的正则表达式来去除 HTML 标签。

HTTP 请求

一个 HTTP 请求是客户端发送给浏览器的消息，用于检索某些信息或执行某些操作。

Request 的 request 方法创建一个新请求。请注意，request 模块有一些更高级别的方法，例如 get、post 或 put，它们为我们节省了一些输入工作。

create_request.py

#!/usr/bin/python

import requests as req

resp = req.request(method='GET', url="http://www.webcode.me")
print(resp.text)

该示例创建一个 GET 请求并将其发送到 http://www.webcode.me。

Python requests 获取状态

Response 对象包含服务器对 HTTP 请求的响应。其 status_code 属性返回响应的 HTTP 状态码，例如 200 或 404。

get_status.py

#!/usr/bin/python

import requests as req

resp = req.get("http://www.webcode.me")
print(resp.status_code)

resp = req.get("http://www.webcode.me/news")
print(resp.status_code)

我们使用 get 方法执行两个 HTTP 请求，并检查返回的状态。

$ ./get_status.py
200
404

200 是成功 HTTP 请求的标准响应，而 404 表示找不到请求的资源。

Python requests head 方法

head 方法检索文档头信息。头信息由字段组成，包括日期、服务器、内容类型或最后修改时间。

head_request.py

#!/usr/bin/python

import requests as req

resp = req.head("http://www.webcode.me")

print("Server: " + resp.headers['server'])
print("Last modified: " + resp.headers['last-modified'])
print("Content type: " + resp.headers['content-type'])

该示例打印 www.webcode.me 网页的服务器、最后修改时间和内容类型。

$ ./head_request.py
Server: nginx/1.6.2
Last modified: Sat, 20 Jul 2019 11:49:25 GMT
Content type: text/html

这是 head_request.py 程序的输出。

Python requests get 方法

get 方法向服务器发出 GET 请求。GET 方法请求指定资源的表示形式。

httpbin.org 是一个免费可用的 HTTP 请求和响应服务。

mget.py

#!/usr/bin/python

import requests as req

resp = req.get("https://httpbin.org/get?name=Peter")
print(resp.text)

该脚本向 httpbin.org 服务器发送一个带值的变量。该变量直接在 URL 中指定。

$ ./mget.py
{
  "args": {
    "name": "Peter"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.21.0"
  },
  ...
}

mget2.py

#!/usr/bin/python

import requests as req

payload = {'name': 'Peter', 'age': 23}
resp = req.get("https://httpbin.org/get", params=payload)

print(resp.url)
print(resp.text)

get 方法接受一个 params 参数，我们可以在其中指定查询参数。

payload = {'name': 'Peter', 'age': 23}

数据以 Python 字典的形式发送。

resp = req.get("https://httpbin.org/get", params=payload)

我们向 httpbin.org 网站发送一个 GET 请求，并传递在 params 参数中指定的数据。

print(resp.url)
print(resp.text)

我们将 URL 和响应内容打印到控制台。

$ ./mget2.py
http://httpbin.org/get?name=Peter&age=23
{
  "args": {
    "age": "23",
    "name": "Peter"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.21.0"
  },
  ...
}

Python requests 重定向

重定向是将一个 URL 转发到另一个 URL 的过程。HTTP 响应状态码 301 Moved Permanently 用于永久 URL 重定向；302 Found 用于临时重定向。

redirect.py

#!/usr/bin/python

import requests as req

resp = req.get("https://httpbin.org/redirect-to?url=/")

print(resp.status_code)
print(resp.history)
print(resp.url)

在该示例中，我们向 https://httpbin.org/redirect-to 页面发出一个 GET 请求。此页面会重定向到另一个页面；重定向响应存储在响应的 history 属性中。

$ ./redirect.py
200
[<Response [302]>]
https://httpbin.org/

对 https://httpbin.org/redirect-to 的 GET 请求被 302 重定向到 https://httpbin.org。

在第二个示例中，我们不跟随重定向。

redirect2.py

#!/usr/bin/python

import requests as req

resp = req.get("https://httpbin.org/redirect-to?url=/", allow_redirects=False)

print(resp.status_code)
print(resp.url)

allow_redirects 参数指定是否跟随重定向；默认情况下会跟随重定向。

$ ./redirect2.py
302
https://httpbin.org/redirect-to?url=/

使用 nginx 进行重定向

在下一个示例中，我们展示了如何在 nginx 服务器中设置页面重定向。

location = /oldpage.html {

        return 301 /newpage.html;
}

将这些行添加到 nginx 配置文件中，在 Debian 上该文件位于 /etc/nginx/sites-available/default。

$ sudo service nginx restart

编辑文件后，我们必须重启 nginx 以应用更改。

oldpage.html

<!DOCTYPE html>
<html>
<head>
<title>Old page</title>
</head>
<body>
<p>
This is old page
</p>
</body>
</html>

这是位于 nginx 文档根目录下的 oldpage.html 文件。

newpage.html

<!DOCTYPE html>
<html>
<head>
<title>New page</title>
</head>
<body>
<p>
This is a new page
</p>
</body>
</html>

这是 newpage.html。

redirect3.py

#!/usr/bin/python

import requests as req

resp = req.get("https:///oldpage.html")

print(resp.status_code)
print(resp.history)
print(resp.url)

print(resp.text)

此脚本访问旧页面并跟随重定向。正如我们已经提到的，Requests 默认会跟随重定向。

$ ./redirect3.py
200
(<Response [301]>,)
https:///files/newpage.html
<!DOCTYPE html>
<html>
<head>
<title>New page</title>
</head>
<body>
<p>
This is a new page
</p>
</body>
</html>

$ sudo tail -2 /var/log/nginx/access.log
127.0.0.1 - - [21/Jul/2019:07:41:27 -0400] "GET /oldpage.html HTTP/1.1" 301 184
"-" "python-requests/2.4.3 CPython/3.4.2 Linux/3.16.0-4-amd64"
127.0.0.1 - - [21/Jul/2019:07:41:27 -0400] "GET /newpage.html HTTP/1.1" 200 109
"-" "python-requests/2.4.3 CPython/3.4.2 Linux/3.16.0-4-amd64"

从 access.log 文件中我们可以看到，请求被重定向到一个新的文件名。通信由两个 GET 请求组成。

用户代理

在本节中，我们指定用户代理的名称。我们创建自己的 Python HTTP 服务器。

http_server.py

#!/usr/bin/python

from http.server import BaseHTTPRequestHandler, HTTPServer

class MyHandler(BaseHTTPRequestHandler):

    def do_GET(self):

        message = "Hello there"

        self.send_response(200)

        if self.path == '/agent':

            message = self.headers['user-agent']

        self.send_header('Content-type', 'text/html')
        self.end_headers()

        self.wfile.write(bytes(message, "utf8"))

        return


def main():

    print('starting server on port 8081...')

    server_address = ('127.0.0.1', 8081)
    httpd = HTTPServer(server_address, MyHandler)
    httpd.serve_forever()

main()

我们有一个简单的 Python HTTP 服务器。

if self.path == '/agent':

    message = self.headers['user-agent']

如果路径包含 '/agent'，我们返回指定的用户代理。

user_agent.py

#!/usr/bin/python

import requests as req

headers = {'user-agent': 'Python script'}

resp = req.get("https://:8081/agent", headers=headers)
print(resp.text)

此脚本向我们的 Python HTTP 服务器创建一个简单的 GET 请求。要向请求添加 HTTP 头，我们将一个字典传递给 headers 参数。

headers = {'user-agent': 'Python script'}

头信息的值被放置在一个 Python 字典中。

resp = req.get("https://:8081/agent", headers=headers)

这些值被传递给 headers 参数。

$ simple_server.py
starting server on port 8081...

首先，我们启动服务器。

$ ./user_agent.py
Python script

然后我们运行该脚本。服务器用我们随请求发送的代理名称进行了响应。

Python requests post 值

post 方法在给定的 URL 上分派一个 POST 请求，为表单填写内容提供键/值对。

post_value.py

#!/usr/bin/python

import requests as req

data = {'name': 'Peter'}

resp = req.post("https://httpbin.org/post", data)
print(resp.text)

该脚本发送一个带有键 name 值为 Peter 的请求。POST 请求是使用 post 方法发出的。

$ ./post_value.py
{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "name": "Peter"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Content-Length": "10",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.21.0"
  },
  "json": null,
  ...
}

这是 post_value.py 脚本的输出。

Python requests 上传图片

在下面的示例中，我们将要上传一张图片。我们使用 Flask 创建一个 Web 应用程序。

app.py

#!/usr/bin/python

import os
from flask import Flask, request

app = Flask(__name__)

@app.route("/")
def home():
    return 'This is home page'

@app.route("/upload", methods=['POST'])
def handleFileUpload():

    msg = 'failed to upload image'

    if 'image' in request.files:

        photo = request.files['image']

        if photo.filename != '':

            photo.save(os.path.join('.', photo.filename))
            msg = 'image uploaded successfully'

    return msg

if __name__ == '__main__':
    app.run()

这是一个具有两个端点的简单应用程序。/upload 端点检查是否有图片，并将其保存到当前目录。

upload_file.py

#!/usr/bin/python

import requests as req

url = 'https://:5000/upload'

with open('sid.jpg', 'rb') as f:

    files = {'image': f}

    r = req.post(url, files=files)
    print(r.text)

我们将图片发送到 Flask 应用程序。文件在 post 方法的 files 属性中指定。

JSON

JSON (JavaScript Object Notation) 是一种轻量级的数据交换格式。它易于人类阅读和编写，也易于机器解析和生成。

JSON 数据是键/值对的集合；在 Python 中，它由字典实现。

读取 JSON

在第一个示例中，我们从一个 PHP 脚本中读取 JSON 数据。

send_json.php

<?php

$data = [ 'name' => 'Jane', 'age' => 17 ];
header('Content-Type: application/json');

echo json_encode($data);

这个 PHP 脚本发送 JSON 数据。它使用 json_encode 函数来完成这项工作。

read_json.py

#!/usr/bin/python

import requests as req

resp = req.get("https:///send_json.php")
print(resp.json())

read_json.py 读取由 PHP 脚本发送的 JSON 数据。

print(resp.json())

json 方法返回响应的 json 编码内容（如果存在）。

$ ./read_json.py
{'age': 17, 'name': 'Jane'}

发送 JSON

接下来，我们从一个 Python 脚本向一个 PHP 脚本发送 JSON 数据。

parse_json.php

<?php

$data = file_get_contents("php://input");

$json = json_decode($data , true);

foreach ($json as $key => $value) {

    if (!is_array($value)) {
        echo "The $key is $value\n";
    } else {
        foreach ($value as $key => $val) {
            echo "The $key is $value\n";
        }
    }
}

这个 PHP 脚本读取 JSON 数据，并返回一条包含解析后值的消息。

send_json.py

#!/usr/bin/python

import requests as req

data = {'name': 'Jane', 'age': 17}

resp = req.post("https:///parse_json.php", json=data)
print(resp.text)

这个脚本向 PHP 应用程序发送 JSON 数据并读取其响应。

data = {'name': 'Jane', 'age': 17}

这是要发送的数据。

resp = req.post("https:///parse_json.php", json=data)

包含 JSON 数据的字典被传递给 json 参数。

$ ./send_json.py
The name is Jane
The age is 17

这是示例输出。

从字典中检索定义

在下面的示例中，我们在 www.dictionary.com 上查找一个术语的定义。为了解析 HTML，我们使用 lxml 模块。

$ pip install lxml

我们使用 pip 工具安装 lxml 模块。

get_term.py

#!/usr/bin/python

import requests as req
from lxml import html
import textwrap

term = "dog"

resp = req.get("http://www.dictionary.com/browse/" + term)
root = html.fromstring(resp.content)

for sel in root.xpath("//span[contains(@class, 'one-click-content')]"):

    if sel.text:

        s = sel.text.strip()

        if (len(s) > 3):

            print(textwrap.fill(s, width=50))

在这个脚本中，我们在 www.dictionary.com 上查找术语 dog 的定义。使用 lxml 模块来解析 HTML 代码。

注意： 包含定义的标签可能会随时改变。在这种情况下，我们需要调整脚本。

from lxml import html

lxml 模块可用于解析 HTML。

import textwrap

textwrap 模块用于将文本包装到指定的宽度。

resp = req.get("http://www.dictionary.com/browse/" + term)

要执行搜索，我们将术语附加到 URL 的末尾。

root = html.fromstring(resp.content)

我们需要使用 resp.content 而不是 resp.text，因为 html.fromstring 隐式期望字节作为输入。（resp.content 返回字节形式的内容，而 resp.text 返回 Unicode 文本。）

for sel in root.xpath("//span[contains(@class, 'one-click-content')]"):

    if sel.text:

        s = sel.text.strip()

        if (len(s) > 3):

            print(textwrap.fill(s, width=50))

我们解析内容。主要定义位于 span 标签内，该标签具有 one-click-content 属性。我们通过删除多余的空白和零散字符来改善格式。文本宽度最大为 50 个字符。请注意，此类解析可能会发生变化。

$ ./get_term.py
a domesticated canid,
any carnivore of the dog family Canidae, having
prominent canine teeth and, in the wild state, a
long and slender muzzle, a deep-chested muscular
body, a bushy tail, and large, erect ears.
...

这是定义的部分列表。

Python requests 流式请求

流式传输是指在先前部分正在被使用的同时，传输连续的音频和/或视频数据流。Requests.iter_lines 逐行迭代响应数据。在请求上设置 stream=True 可以避免将大型响应的内容一次性读入内存。

streaming.py

#!/usr/bin/python

import requests as req

url = "https://docs.oracle.com/javase/specs/jls/se8/jls8.pdf"

local_filename = url.split('/')[-1]

r = req.get(url, stream=True)

with open(local_filename, 'wb') as f:

    for chunk in r.iter_content(chunk_size=1024):

        f.write(chunk)

该示例流式传输一个 PDF 文件并将其写入磁盘。

r = req.get(url, stream=True)

在发出请求时将 stream 设置为 True，除非我们消耗所有数据或调用 Response.close，否则 Requests 无法将连接释放回连接池。

with open(local_filename, 'wb') as f:

    for chunk in r.iter_content(chunk_size=1024):

        f.write(chunk)

我们以 1 KB 的块读取资源，并将其写入本地文件。

Python requests 凭据

auth 参数提供基本的 HTTP 身份验证；它接受一个包含用户名和密码的元组，用于某个领域（realm）。安全领域是一种用于保护 Web 应用程序资源的机制。

$ sudo apt-get install apache2-utils
$ sudo htpasswd -c /etc/nginx/.htpasswd user7
New password:
Re-type new password:
Adding password for user user7

我们使用 htpasswd 工具为基本 HTTP 身份验证创建一个用户名和密码。

location /secure {

        auth_basic "Restricted Area";
        auth_basic_user_file /etc/nginx/.htpasswd;
}

在 nginx 的 /etc/nginx/sites-available/default 配置文件中，我们创建一个受保护的页面。该领域的名称是“Restricted Area”。

index.html

<!DOCTYPE html>
<html lang="en">
<head>
<title>Secure page</title>
</head>

<body>

<p>
This is a secure page.
</p>

</body>

</html>

在 /usr/share/nginx/html/secure 目录中，我们有这个 HTML 文件。

credentials.py

#!/usr/bin/python

import requests as req

user = 'user7'
passwd = '7user'

resp = req.get("https:///secure/", auth=(user, passwd))
print(resp.text)

该脚本连接到受保护的网页；它提供访问该页面所需的用户名和密码。

$ ./credentials.py
<!DOCTYPE html>
<html lang="en">
<head>
<title>Secure page</title>
</head>

<body>

<p>
This is a secure page.
</p>

</body>

</html>

使用正确的凭据，credentials.py 脚本将返回受保护的页面。

来源

Python requests 文档

在本文中，我们学习了如何使用 Python Requests 模块。

作者

我的名字是 Jan Bodnar，我是一名充满热情的程序员，拥有丰富的编程经验。我从 2007 年开始撰写编程文章。至今，我已撰写了超过 1400 篇文章和 8 本电子书。我拥有超过十年的编程教学经验。

列出所有 Python 教程。