且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Python 实现 urllib3 和 requests 库使用| 学习笔记

更新时间:2022-09-07 23:23:02

开发者学堂课程【Python爬虫实战Python 实现 urllib3 和 requests 库使用 】学习笔记,与课程紧密联系,让用户快速学习知识。

课程地址:https://developer.aliyun.com/learning/course/555/detail/7644


Python 实现 urllib3 和 requests 库使用


内容简介:

一、 urllib3 库

二、 request 库


一、urllib3 库

(1)简介

https://urllib3.readthedocs.io/en/latest/

标准库urlIib缺少了一些关键的功能,非标准库的第三方库 urllib3 提供了,比如说连接池管理。

(2)安装

$pip  install urllib3

Import urllib3

# 打开一个url返回一个对象

url = ‘https://movie.douban.com/

ua = “Mozilla/5.0(Windows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko) Chrome/55.0.2883.75Safati/537.36”

连接池管理器

with urllib3.PoolManager ()as http:

response = http.request('GET', url , headers={

'User-Agent':ua

})

print(type(response))

print(response.status, response.   reason)

print(response. headers)

print(response. data)

在下面的目标框中运行如下代码:

pip install urllib3

点击urllib3会出现以下代码(其中all是所需要用到的东西):

urllib3-Thread-safe  connection  pooling and re-using.

import...

try: # Python 2.7+

from logging import  NullHandler

except  ImportError :

class  NullHandler (logging. Handler):

def emit(self, record):

pass

__ author__='Andrey Petrov(andrey.petrov@shazow.net )’

__ license __='MIT'

__ version __=’ 1.23'

__ all__=(

' HTTPConnectionPool ',

‘ ETTPSConnectionPool ',

‘ PoolManager ',

' ProxyManager ',

' HTTPResponse ',

' Retry',

‘ Timeout’,

‘ add_stderr_logger',

‘ connection_from_url',

‘ disable_warings’,

'encode_multipart_formdata′,

'get_host',

'make_headers',

‘proxy_from_url',

)

运行一个实例

import urllib3

withurlib3 . PoolManager () as http:

http.request()

在上述代码上点击request会跳转到如下代码中

def request(self, method, bill, fields=None, headers=None, **urlopen_kw):

“””

Make a request using:meth:"urlopen"with the  appropriate  encoding of I fields”based on the"method"used.

This is a  convenience  method that requires the least amount of manual effort. It can be used in most  situations , while still having the option to drop down to more specific methods when necessary,such as:meth:’request_encode_url', :meth:’request_encode_body’, or even the lowest level:meth:’urlopen‘.

method=method.upper()

urlopen_kw['request_url’]=url

if method in self. encode url methods:

return self. request_encode_url(method, url,fields=fields,

headers-headers,

**urlopen_kw)

else:

return self.request_encode_body(method,url, fields=fields,

headers-headers,

**urlopen_kw)

运行另一个方法的实例

import urllib3

withurlib3 . PoolManager () as http:

http.urlopen()

在上述代码上点击urlopen会跳转到如下代码中

def urlopen(self, method, url, redirect=True,**kw):

“”“

Same as:meth:'urllib3.connectionpool , HTTPConnectionPool , urlopen‘with custom cross-host redirect logic and only sends the request-uri portion of the’’url’’

The given‘‘url‘‘parameter must be absolute, such that an  appropriate :class:’urllib3.  connectionpool . ConectionPool‘ can be chosen for it.

u=parse_url(url)

conn=self.connection_from host(u.host, port=u.port, scheme=u. scheme)

kw[‘assert_same_host']=False

kw['redirect']=False

if'headers'not in kw:

kw['headers']=self. headers. copy()

if self. proxyisnotNoneandu .scheme——"http":

response=conn. urlopen(method, url,kw)

else:

response-conn. urlopen(method,u. request_uri,**kw)

redirect location=redirect and response. get redirect location/l if not redirect location:

return response

#Support relative URLs for  redirecting .

redirect_location=urljoin(url, redirect _ location)

#RFC 7231,Section 6.4.4

response,status==303:

method='GET'

retries=kw. get('retries')

if not  isinstance (retries, Retry):

retries=Retry.from_init(retries, redirect=redirect)

# Strip headers marked as unsafe to forward to the  redirected  location.·

#Check remove_headers_on_redirect to avoid a potential network call within

#conn, is_same_host() which may use socket.gethostbyname () in the future.

if(retries. remove headers on redirect

and not conn. is_same_host(redirect _ location)):

for header in retries.remove_headers_on_redirect: kw[ˈheaders']. pop(header, None)

try:

retries=retries.increment(method,url,response=response,_pool=conn)

except   MaxRetryError :

if retries. raise _ on _ redirect:

raise

return response

kw['retries']=retries

kw[ˈredirect']=redirect

log.info (" Redirecting &s->s", url, redirect_location)

return self.urlopen(method, redirect_location,**kw)

(3)例子:

import urlib3

from urllib. parse import urlencode

from urllib3.response import HTMLResponse

jurl='httpss:// movie.douban.com/j/search_subjects ‘

d={

'type':'movie',

‘tag':‘热门',

'page _limit':10,

'page _ start':10

}

with urllib3.PoolManager () as http:

response=http.request(‘GET','{}?{},format(jurl, urlencodede(d)),headerer={

‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko)Chrome/55.0. 2833.75 Safari/533.30

})

print(type(response))

response:HTTPResponse = HTTPResponse()

response.status

//点击status会跳入到以下代码页面(其中reason,status的只是可以使用的,还有许多属性比如池子属性,连接属性这些都是不能让我们用到的)

if  isinstance (headers, HTTPHeaderDict ):

self. headers=headers

else:

self. headers=HTTPHeaderDict (headers)

self. status=status

self. version=version

self. reason=reason

self. strict=strict

self. decode_content=decode _ content

self. retries=retries

self, enforce_content _ length=enforce_content_length

self. _decoder=None

self. _body=None

self. _fp=None

self. _original_response=original_response

self. _fp_bytes_read=0

self. msg=msg

self. request_url=request_url

以下可以在代码上打印出status和data:

import urllib3

from urllib. parse import urlencode

from urllib3.response import HTMLResponse

jurl='httpss:// movie.douban.com/j/search_subjects ‘

d={

'type':'movie',

‘tag':‘热门',

'page _limit':10,

'page _ start':10

}

with urllib3.PoolManager () as http:

response=http.request(‘GET','{}?{},format(jurl, urlencodede(d)),headerer={

‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko)Chrome/55.0. 2833.75 Safari/533.30

})

print(type(response))

#response:HTTPResponse = HTTPResponse()//这是3.6允许的语法内容

print(response.status)

print(response.data)

不同的 response 属性各不相同,所以***的方式是把 response 写出来加”.”来出现错具备的属性

结果中,因为用到的 data,所以返回的是 bytes,在访问过程中会在池子中找到一个连接,把连接之后的东西装到 response 中去,只需要关心 response是谁,然后对其进行操作。

urllib 中需要做很多封装,发现在连接池管理器中所提供的方法和属性还是比较原始的,所以为了更加方便,就需要用到接下来使用的 request 库。


二、request库**

(1)简介

封装效果非常好,request 使用 urllib3,但是 API 用着更加友好,推荐使用

import requests

ua="Mozilla/5.8(Windows NT6.1) Applewebkit /537. 36(KHTML, like Gcko) Chrome/55.6. 2833.7. Safari/537. 36"

url=' https://movie.douban.com/'

response=requests. request('GET', url, headers={'User-Agent':ua})with response:

print(type(response))

print(response. url)

print(responsel. status _ code)

print(response. request. headers)#请求头

print(response. headers)#响应头

print(response. text[:200])#HTML的内容

with open('o:/movie. html','w', encoding='utf-8') as f:

f. write(response. text)#保存文件,以后备用

requests 默认使用 Session 对象,是为了在多次和服务器端交互中保留会话的信息,例如 cookie。

#直接使用 Session

import requests

ua="Mozillal/5.0( WindowsNT6 .1)  Applewebbit /537. 36(KHTML, like Gecko) Chrome/55.0. 2833.7. Safari/537. 36"

urls=[" https://www.baidu.com/s?wd=maged.', ` https://www.baidu.com/s?wd=maged= "session=Irequests. Session()

with session:

for url in urls:

response=session. get(url, headers={'User-Agent':ua})

with response:

print(type(response))

print(response. url)

print(response. status _ code)

print(response. request. headers)#请求头

print(response. cookies)#响应的cookie

print(response. text[:20])#HTML的内容

(2)安装及其实例

首先需要在软件下方运行效果框中查找是否安装成功,查找代码为:pip install requests,并且它依赖了idna ,certifi,urllib3,以及 chardet

import urllib3

from urllib. parse import urlencode

from urllib3.response import HTMLResponse

import

jurl='httpss:// movie.douban.com/j/search_subjects ‘

d={

'type':'movie',

‘tag':‘热门',

'page _limit':10,

'page _ start':10

}

url='{}?{},format(jurl, urlencodede(d))

response= request.request(‘GET’,url,headers={

‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko)Chrome/55.0. 2833.75 Safari/533.30

})

with response:

print(response.text)//text 是一个属性,是一个在 Unicode 返回的内容,也就是说明他已经转化过,相比较下用起来更加方便,编码是由系统定的,不用关心,只需要返回内容就行,按照编码顺序排列

print(response.status_code)

print(response.url)//url 是最终的定位,如果它本身能够跳转的话,拿到的就是跳转之后的位置

print(response.request)//可以拿到预处理的 ,在model文件名中可以找到,其中包括初始化、预处理的请求、方法、url、headers、_cookies 以及 body 等,预处理的方法就是组装请求。Request 中的这么多属性,是可以拿到的

//点击进入 request 后可以看到详细内容,其中重要的一句代码是:

with sessions.Session() as session:

return session.request(method= method,url=url,**kwargs)

表示在默认情况下是使用 session 机制俩管理会话请求,也就是有的 id 会在 session 中发出会话请求,就可以在两者之间来回传送。

上述代码进行添加后变成以下内容后运行:

import urllib3

from urllib. parse import urlencode

from urllib3.response import HTMLResponse

import request

jurl='httpss:// movie.douban.com/j/search_subjects ‘

d={

'type':'movie',

‘tag':‘热门',

'page _limit':10,

'page _ start':10

}

url='{}?{},format(jurl, urlencodede(d))

response= request.request(‘GET’,url,headers={

‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko)Chrome/55.0. 2833.75 Safari/533.30

})

with response:

print(response.text)

print(response.status_code)

print(response.url)

print(response.headers,’~~~~~’)

print(response.request.headers)

进一步改良代码后,代码如下:

import urllib3

from urllib. parse import urlencode

from urllib3.response import HTMLResponse

import request

urls=[" https://www.baidu.com/s?wd=maged.', ` https://www.baidu.com/s?wd=maged= "session=Irequests. Session()

with session:

response= session.get(url,headers={

‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko)Chrome/55.0. 2833.75 Safari/533.30

})

with response:

print(response.text[:50])

print(‘-‘*30)

print(response.cookies)

print(’-’*30)

print(response.headders,’~~~~~’)

print(response.request.headers)

注意:每次运行的结果的信息都在变化,有时会出现 Set-Cookie。