HTTP for Human: 深入研究 Requests

这是一篇学习 Requests 的随笔.

Python 回顾

bytearray是一个字节数组, 数组的内容是可变的, 数组中的每一个元素都是 [0, 256)之间的整数定义为: bytearray([source[, encoding[, errors]]]) 其中source参数有几种情况:
- If it is a string, you must also give the encoding (and optionally, errors) parameters; bytearray() then converts the string to bytes using str.encode().
- If it is an integer, the array will have that size and will be initialized with null bytes.
- If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array.
- If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array.
bytearray是一个Mutable Sequence, 所以支持大多数list的操作:
(注意: 图片放在source/images中, 会被拷贝到public/images中, 在Markdown中绝对引用 /images/xxx.jpg才能显示图片)
bytes是一个不可变版本的bytearray bytes对象可以通过字面量(literal) 直接创建: b''或者B''

开始

我们最常使用的requests.get(), requests.post() 等方法都是简便的 API, 传递给这些 API 的参数被用来构造Request对象.

Request 对象有两种形态, 一种是普通的对象形态, 一种是经过 prepare 之后转换为 PreparedRequest, 才是真正可以被发送的对象, 其中的body, header, cookie 等都已经转换成符合HTTP规范的字符串了, 准备好在网络上传输了.

PreparedRequest对象只是将要被发送的内容, 而自身没有发送的功能. 发送的功能是由session对象提供的, session.send()方法会调用adapter.send(), 将请求真正地发送出去.

请求被发送之后收到服务器的响应, 响应被adapter解析为Response对象, 从requests.get(), requests.post()等方法中返回. 调用者可以用Response中获取需要的信息.

常见用法

普通青年的用法

resp = requests.get('http://www.baidu.com')
print(resp.content)

作死青年的用法

s = requests.Session()
s.headers.update({'laozizuidiao': 'laozizuidiao'})
s.get('http://www.gov.cn')
s.get('http://www.beijing.gov.cn')

文艺青年的用法

  s = Session()
  req = Request('GET', url,
     data=data,
     headers=header
  )
  prepped = req.prepare()
  //prepped = s.prepare_request(req) 这样可以把 session 中的已有的状态添加到这个 req 中
 resp = s.send(prepped,
     stream=stream,
     verify=verify,
     proxies=proxies,
     cert=cert,
     timeout=timeout
 )

 print(resp.content)

下面我们逐个分析 Requests 的组成部分.

Request

Request其实是一个没什么用的类, 我还不怎么明白为什么要设计这个类.

Request继承自RequestHookMixin, 自身只有一个方法prepare(), 作用就是新建一个PreparedRequest对象, 然后将自己的那些参数传给PreparedReques.prepare(), 准备好一个可以发送的对象

我们直接使用的API, 都是先用我们传递的参数构造这个对象.

下面我们来分析一下常用的几个参数: TODO

params

data

headers

cookies

allow_redirects

auth

json

stream

PreparedRequest

PreparedRequest是请求过程中最重要的一个类, 而这个类中最重要的就是它的各个prepare_xxx方法.

TODO

Response

❓Response 实际是由adapter生成的, adapter是什么?

下面分析一下Response常用的属性:

.status_code 服务器响应的状态码

.reason 原因短语

.url TODO

.encoding

.headers 是一个CaseInsensitiveDict

.cookies初始化为cookiejar_from_dict({}), 这个函数的定义为:

def cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True):
    """Returns a CookieJar from a key/value dictionary.

    :param cookie_dict: Dict of key/values to insert into CookieJar.
    :param cookiejar: (optional) A cookiejar to add the cookies to.
    :param overwrite: (optional) If False, will not replace cookies
        already in the jar with new ones.
    """

也就是说将一个dict添加到一个cookiejar中, 如果cookiejar为None, 则新建一个空的RequestCookieJar

.request是请求时的PreparedRequest对象

.elapsed TODO

.content 是没有 decode过的字节串, 而.text 会自动根据.encoding 中的编码自动将content解码为str. 如果手动改变.encoding, 那么重新读取.text时也会重新解码

默认情况, 请求之后响应体会立即被下载. 如果设置 stream=True , 则直到访问 Response.content时才会下载但是设置stream=True, Requests 无法自动将连接释放到连接池中, 除非消耗了所有数据或者显式地调用Response.close()

.history是一个list TODO

Reponse的核心就是.raw, iter_content()依次从raw中read数据, 而iter_lines, .content , __iter__()都直接依赖于iter_content(), .text依赖于.content, .json()依赖于.text

Session

重头戏, 好好说说. TODO

在方法级使用的参数(比如cookies)的优先级高于会话级, 但是在方法内使用的参数不会在之后的请求中保留.

Requests 中默认使用的RequestCookieJar 继承自 cookielib.CookieJar, 而 Python3 中 cookielib已被移到http.cookiejar中.

为了适配http.cookiejar.CookieJar的接口, requests.cookies又使用了MockRequest, MockResponse, 而http.cookiejar的代码相当复杂, 实在是搞不懂.

RequestCookieJar为http.cookiejar.CookieJar提供了dict的访问接口. Requests 的所有代码都可以无障碍地使用cookielib.CookieJar的子类, 比如 LWPCookieJar, FileCookieJar.

CookiePolicy

该类的主要功能是收发cookie, 即确保正确的cookie发往对应的域名, 反之一样.

DefaultCookiePolicy

实现了CookiePolicy接口.

Cookie

可以看成一条 cookie 数据. 有好几种版本的 cookie 吗? Netscape 和 RFC2965 ? 相应的头部分别是 Set-Cookie 和 Set-Cookie2

CookieJar

很多条 cookie 数据的集合, 是我们主要的操作对象, 里面有一系列的方法支持更细致的操作. Requests 的 cookie 主要依赖于此类.

FileCookieJar

该类继承自CookieJar，CookieJar只是在内存中完成自己的生命周期，FileCookieJar的子类能够实现数据持久化，定义了save、load、revert三个接口

LWPCookieJar

Response.cookies 默认是RequestCookieJar, 因为一般不会手动构造Response, 都是由HTTPAdapter.build_response()方法构建的, 所以Response.cookies基本永远都是RequestCookieJar, 用户可以方便的使用. 而Session.cookies 可以由用户随意替换, 使用LWPCookieJar等已经实现了load(),save()等方法的 cookie 比较方便.

MozillaCookieJar

Auth

各个API都有auth参数, 这个参数可以是一个(user, pass) 这样的 2-tuple, 也可以是一个 callable.

这个 callable 接受 PreparedRequest对象作为参数, 可以在发送之前修改请求(比如说添加 Authorization 头部). requests.auth中提供了两个常用的验证方法:HTTPBasicAuth和HTTPDigestAuth, 其他自定义的身份验证机制一般继承自requests.auth.AuthBase 类, 这个类虽然什么也没做, 但是语义化比较好.

Proxy

我自己还没明白怎么用.

Adapter

非常重要的基础类, 使用urllib3提供发送功能, 很多概念不太懂, 日后再说. 😄

Python3 中 urllib 和 urllib2 合并为 urllib, urllib3是一个第三方库, 是对httplib的包装, 发请求的时候，拼凑http请求, 收到回复的时候,解析response,timeout 也是用httplib 库的timeout参数, requests 内部集成了 urllib3.

urllib3实现

{% plantuml %} title hello “requests.get()” -> “requests.request()”: “requests.request()”-> “Session.request()”: “Session.request()” -> “Session.send()”: “Session.send()” -> “HTTPAdapter.send()”: “HTTPAdapter.send()” -> “HTTPAdapter.build_response()":HTTPResponse {% endplantuml%}

Session.send()调用HTTPAdapter.send(), HTTPAdapter.send()从urllib3中获取一个connection, 然后调用其urlopen或者HTTPResponse.fromhttplib得到一个HTTPResponse对象. 然后调用Adapter.build_response(), 从HTTPResponse中取出各种数据构造Requests.Response, 将Requests.Response.raw设为这个HTTPResponse对象, 从HTTPResponse中取出cookie添加到Response.cookie中. 然后Session.send()再一次从HTTPRespnse中获取cookie, 添加到自己的Session.cookie中.

编码问题

issue #2266 中说, 会将 util.get_encodings_from_content这样的方法移到requests-toolbelt中, 使得 requests 更专注于 HTTP 而不是 HTML.

Python 回顾#

开始#

常见用法#

Request#

PreparedRequest#

Response#

Session#

Cookie#

Auth#

Proxy#

Adapter#

编码问题#

参考链接#