Python 实现 urllib3 和 requests 库使用| 学习笔记

更新时间：2022-09-07 23:23:02

开发者学堂课程【Python爬虫实战：Python 实现 urllib3 和 requests 库使用】学习笔记，与课程紧密联系，让用户快速学习知识。

课程地址：https://developer.aliyun.com/learning/course/555/detail/7644

Python 实现 urllib3 和 requests 库使用

内容简介：

一、 urllib3 库

二、 request 库

一、urllib3 库

（1）简介

https://urllib3.readthedocs.io/en/latest/

标准库urlIib缺少了一些关键的功能，非标准库的第三方库 urllib3 提供了，比如说连接池管理。

（2）安装

$pip install urllib3

Import urllib3

# 打开一个url返回一个对象

url = ‘https://movie.douban.com/’

ua = “Mozilla/5.0(Windows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko) Chrome/55.0.2883.75Safati/537.36”

连接池管理器

with urllib3.PoolManager ()as http:

response = http.request('GET', url , headers={

'User-Agent':ua

})

print(type(response))

print(response.status, response. reason)

print(response. headers)

print(response. data)

在下面的目标框中运行如下代码：

pip install urllib3

点击urllib3会出现以下代码(其中all是所需要用到的东西)：

urllib3-Thread-safe connection pooling and re-using.

import...

try: # Python 2.7+

from logging import NullHandler

except ImportError :

class NullHandler (logging. Handler):

def emit(self, record):

pass

__ author__='Andrey Petrov(andrey.petrov@shazow.net )’

__ license __='MIT'

__ version __=’ 1.23'

__ all__=(

' HTTPConnectionPool ',

‘ ETTPSConnectionPool ',

‘ PoolManager ',

' ProxyManager ',

' HTTPResponse ',

' Retry',

‘ Timeout’,

‘ add_stderr_logger',

‘ connection_from_url',

‘ disable_warings’,

'encode_multipart_formdata′,

'get_host',

'make_headers',

‘proxy_from_url',

)

运行一个实例

import urllib3

withurlib3 . PoolManager () as http:

http.request()

在上述代码上点击request会跳转到如下代码中

def request(self, method, bill, fields=None, headers=None, **urlopen_kw):

“””

Make a request using:meth:"urlopen"with the appropriate encoding of I fields”based on the"method"used.

This is a convenience method that requires the least amount of manual effort. It can be used in most situations , while still having the option to drop down to more specific methods when necessary,such as:meth:’request_encode_url', :meth:’request_encode_body’, or even the lowest level:meth:’urlopen‘.

method=method.upper()

urlopen_kw['request_url’]=url

if method in self. encode url methods:

return self. request_encode_url(method, url,fields=fields,

headers-headers,

**urlopen_kw)

else:

return self.request_encode_body(method,url, fields=fields,

headers-headers,

**urlopen_kw)

运行另一个方法的实例

import urllib3

withurlib3 . PoolManager () as http:

http.urlopen()

在上述代码上点击urlopen会跳转到如下代码中

def urlopen(self, method, url, redirect=True,**kw):

“”“

Same as:meth:'urllib3.connectionpool , HTTPConnectionPool , urlopen‘with custom cross-host redirect logic and only sends the request-uri portion of the’’url’’

The given‘‘url‘‘parameter must be absolute, such that an appropriate :class:’urllib3. connectionpool . ConectionPool‘ can be chosen for it.

u=parse_url(url)

conn=self.connection_from host(u.host, port=u.port, scheme=u. scheme)

kw[‘assert_same_host']=False

kw['redirect']=False

if'headers'not in kw:

kw['headers']=self. headers. copy()

if self. proxyisnotNoneandu .scheme——"http":

response=conn. urlopen(method, url,kw)

else:

response-conn. urlopen(method,u. request_uri,**kw)

redirect location=redirect and response. get redirect location/l if not redirect location:

return response

#Support relative URLs for redirecting .

redirect_location=urljoin(url, redirect _ location)

#RFC 7231,Section 6.4.4

response,status==303:

method='GET'

retries=kw. get('retries')

if not isinstance (retries, Retry):

retries=Retry.from_init(retries, redirect=redirect)

# Strip headers marked as unsafe to forward to the redirected location.·

#Check remove_headers_on_redirect to avoid a potential network call within

#conn, is_same_host() which may use socket.gethostbyname () in the future.

if(retries. remove headers on redirect

and not conn. is_same_host(redirect _ location)):

for header in retries.remove_headers_on_redirect: kw[ˈheaders']. pop(header, None)

try:

retries=retries.increment(method,url,response=response,_pool=conn)

except MaxRetryError :

if retries. raise _ on _ redirect:

raise

return response

kw['retries']=retries

kw[ˈredirect']=redirect

log.info (" Redirecting &s->s", url, redirect_location)

return self.urlopen(method, redirect_location,**kw)

（3）例子：

import urlib3

from urllib. parse import urlencode

from urllib3.response import HTMLResponse

jurl='httpss:// movie.douban.com/j/search_subjects ‘

d={

'type':'movie',

‘tag'：‘热门',

'page _limit':10,

'page _ start':10

}

with urllib3.PoolManager () as http:

response=http.request(‘GET','{}?{},format(jurl, urlencodede(d)),headerer={

‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML，like Gecko)Chrome/55.0. 2833.75 Safari/533.30

})

print(type(response))

response:HTTPResponse = HTTPResponse()

response.status

//点击status会跳入到以下代码页面(其中reason，status的只是可以使用的，还有许多属性比如池子属性，连接属性这些都是不能让我们用到的)

if isinstance (headers, HTTPHeaderDict ):

self. headers=headers

else:

self. headers=HTTPHeaderDict (headers)

self. status=status

self. version=version

self. reason=reason

self. strict=strict

self. decode_content=decode _ content

self. retries=retries

self, enforce_content _ length=enforce_content_length

self. _decoder=None

self. _body=None

self. _fp=None

self. _original_response=original_response

self. _fp_bytes_read=0

self. msg=msg

self. request_url=request_url

以下可以在代码上打印出status和data：

import urllib3

from urllib. parse import urlencode

from urllib3.response import HTMLResponse

jurl='httpss:// movie.douban.com/j/search_subjects ‘

d={

'type':'movie',

‘tag'：‘热门',

'page _limit':10,

'page _ start':10

}

with urllib3.PoolManager () as http:

response=http.request(‘GET','{}?{},format(jurl, urlencodede(d)),headerer={

‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML，like Gecko)Chrome/55.0. 2833.75 Safari/533.30

})

print(type(response))

#response:HTTPResponse = HTTPResponse()//这是3.6允许的语法内容

print(response.status)

print(response.data)

不同的 response 属性各不相同，所以***的方式是把 response 写出来加”.”来出现错具备的属性

结果中，因为用到的 data，所以返回的是 bytes，在访问过程中会在池子中找到一个连接，把连接之后的东西装到 response 中去，只需要关心 response是谁，然后对其进行操作。

urllib 中需要做很多封装，发现在连接池管理器中所提供的方法和属性还是比较原始的，所以为了更加方便，就需要用到接下来使用的 request 库。

二、request库**

（1）简介

封装效果非常好，request 使用 urllib3，但是 API 用着更加友好，推荐使用

import requests

ua="Mozilla/5.8(Windows NT6.1) Applewebkit /537. 36(KHTML, like Gcko) Chrome/55.6. 2833.7. Safari/537. 36"

url=' https://movie.douban.com/'

response=requests. request('GET', url, headers={'User-Agent':ua})with response:

print(type(response))

print(response. url)

print(responsel. status _ code)

print(response. request. headers)#请求头

print(response. headers)#响应头

print(response. text[:200])#HTML的内容

with open('o:/movie. html','w', encoding='utf-8') as f:

f. write(response. text)#保存文件，以后备用

requests 默认使用 Session 对象，是为了在多次和服务器端交互中保留会话的信息，例如 cookie。

#直接使用 Session

import requests

ua="Mozillal/5.0( WindowsNT6 .1) Applewebbit /537. 36(KHTML, like Gecko) Chrome/55.0. 2833.7. Safari/537. 36"

urls=[" https://www.baidu.com/s?wd=maged.', ` https://www.baidu.com/s?wd=maged= "session=Irequests. Session()

with session:

for url in urls:

response=session. get(url, headers={'User-Agent':ua})

with response:

print(type(response))

print(response. url)

print(response. status _ code)

print(response. request. headers)#请求头

print(response. cookies)#响应的cookie

print(response. text[:20])#HTML的内容

（2）安装及其实例

首先需要在软件下方运行效果框中查找是否安装成功，查找代码为：pip install requests，并且它依赖了idna ，certifi，urllib3，以及 chardet

import urllib3

from urllib. parse import urlencode

from urllib3.response import HTMLResponse

import

jurl='httpss:// movie.douban.com/j/search_subjects ‘

d={

'type':'movie',

‘tag'：‘热门',

'page _limit':10,

'page _ start':10

}

url='{}?{},format(jurl, urlencodede(d))

response= request.request(‘GET’,url，headers={

‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML，like Gecko)Chrome/55.0. 2833.75 Safari/533.30

})

with response：

print（response.text）//text 是一个属性，是一个在 Unicode 返回的内容，也就是说明他已经转化过，相比较下用起来更加方便，编码是由系统定的，不用关心，只需要返回内容就行，按照编码顺序排列

print(response.status_code）

print(response.url）//url 是最终的定位，如果它本身能够跳转的话，拿到的就是跳转之后的位置

print(response.request）//可以拿到预处理的，在model文件名中可以找到，其中包括初始化、预处理的请求、方法、url、headers、_cookies 以及 body 等，预处理的方法就是组装请求。Request 中的这么多属性，是可以拿到的

//点击进入 request 后可以看到详细内容，其中重要的一句代码是：

with sessions.Session() as session:

return session.request(method= method,url=url,**kwargs)

表示在默认情况下是使用 session 机制俩管理会话请求，也就是有的 id 会在 session 中发出会话请求，就可以在两者之间来回传送。

上述代码进行添加后变成以下内容后运行：

import urllib3

from urllib. parse import urlencode

from urllib3.response import HTMLResponse

import request

jurl='httpss:// movie.douban.com/j/search_subjects ‘

d={

'type':'movie',

‘tag'：‘热门',

'page _limit':10,

'page _ start':10

}

url='{}?{},format(jurl, urlencodede(d))

response= request.request(‘GET’,url，headers={

‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML，like Gecko)Chrome/55.0. 2833.75 Safari/533.30

})

with response：

print（response.text）

print(response.status_code）

print(response.url）

print(response.headers,’~~~~~’)

print(response.request.headers）

、

进一步改良代码后，代码如下：

import urllib3

from urllib. parse import urlencode

from urllib3.response import HTMLResponse

import request

urls=[" https://www.baidu.com/s?wd=maged.', ` https://www.baidu.com/s?wd=maged= "session=Irequests. Session()

with session:

response= session.get(url，headers={

‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML，like Gecko)Chrome/55.0. 2833.75 Safari/533.30

})

with response：

print（response.text[:50]）

print(‘-‘*30）

print(response.cookies）

print(’-’*30)

print(response.headders,’~~~~~’)

print(response.request.headers）

注意:每次运行的结果的信息都在变化，有时会出现 Set-Cookie。

上一篇 : ：margin 塌陷现象 | 学习笔记下一篇 : python 爬虫实战实现 XPath 和 lxml | 学习笔记

Python 实现 urllib3 和 requests 库使用| 学习笔记