且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何用实体解析HTML,例如& nbsp;使用Python 2中的内置库ElementTree& Python 3?

更新时间:2023-08-25 13:40:28

受到这篇文章,我们可以将一些XML定义添加到传入的raw HTML内容,然后ElementTree会出现问题。



这适用于Python 2.6,2.7,3.3,3.4。

  import xml.etree.ElementTree as ET 

html ='''< html>
< div>合理格式良好的HTML内容。< / div>
< form action =login>
< input name =foovalue =bar/>
< input name =username/>< input name =password/>

< div>看到& nbsp;在HTML页面中。< / div>

< / form>< / html>'''

magic ='''<!DOCTYPE html PUBLIC - // W3C // DTD XHTML 1.0过渡式// EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[
<!ENTITY nbsp''>
>'''#如果需要,您可以在这里定义更多实体

et = ET.fromstring(magic + html)


There are times that you want to parse some reasonably well-formed HTML pages, but you are reluctant to introduce extra library dependency such as BeautifulSoup or lxml. So you will probably like to try the builtin ElementTree first, because it is a standard library, it is fast (implemented in C), and it supports much better interface (such as XPATH support) than the basic HTMLParser. Not to mention, HTMLParser has its own limitations.

ElementTree will work, until it encounters some entities, such as &nbsp;, which are not handled by default.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''
et = ET.fromstring(html)

Run it on Python 2 or Python 3, you will see this error:

xml.etree.ElementTree.ParseError: undefined entity: line 7, column 38

There are some Q&A out there, such as this one and that one. They hint to use ElementTree.XMLParser().parser.UseForeignDTD(True) but I can not get it work in Python 3.3 and Python 3.4.

$ python3.3
Python 3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 01:12:57) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> ET.XMLParser().parser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'xml.etree.ElementTree.XMLParser' object has no attribute 'parser'
>>> 

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''

magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY nbsp ' '>
            ]>'''  # You can define more entities here, if needed

et = ET.fromstring(magic + html)