且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用 lxml 和请求抓取 HTML 会出现 unicode 错误

更新时间:2023-02-25 23:39:19

简短回答:使用 page.content,而不是 page.text.

来自 http://lxml.de/parsing.html#python-unicode-strings:

lxml.etree 中的解析器可以直接处理 unicode 字符串……然而,这要求 unicode 字符串本身不指定冲突的编码,因此对它们的真实编码撒谎

来自 http://docs.python-requests.org/en/latest/user/quickstart/#response-content :

请求将自动解码来自服务器的内容 [作为 r.text]....您还可以以字节 [as r.content] 的形式访问响应正文.

所以你看,requests.textlxml.etree 都想将 utf-8 解码为 un​​icode.但是如果我们让requests.text来解码,那么xml文件里面的编码语句就变成了谎言.

所以,让 requests.content 不进行解码.这样 lxml 将收到一个始终未解码的文件.

I'm trying to use HTML scraper like the one provided here. It works fine for the example they provided. However, when I try using it with my webpage, I receive this error - Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. I've tried googling but couldn't find a solution. I'd truly appreciate any help. I'd like to know if there's a way to copy it as HTML using Python.

Edit:

from lxml import html
import requests
page = requests.get('http://cancer.sanger.ac.uk/cosmic/gene/analysis?ln=PTEN&ln1=PTEN&start=130&end=140&coords=bp%3AAA&sn=&ss=&hn=&sh=&id=15#')
tree = html.fromstring(page.text)

Thank you.

Short answer: use page.content, not page.text.

From http://lxml.de/parsing.html#python-unicode-strings :

the parsers in lxml.etree can handle unicode strings straight away ... This requires, however, that unicode strings do not specify a conflicting encoding themselves and thus lie about their real encoding

From http://docs.python-requests.org/en/latest/user/quickstart/#response-content :

Requests will automatically decode content from the server [as r.text]. ... You can also access the response body as bytes [as r.content].

So you see, both requests.text and lxml.etree want to decode the utf-8 to unicode. But if we let requests.text do the decoding, then the encoding statement inside the xml file becomes a lie.

So, let's have requests.content do no decoding. That way lxml will receive a consistently undecoded file.