HTML解析大法|牛逼的Beautiful Soup！

更新时间：2022-03-02 10:37:07

1.写在前面的话

今天给大家来讲讲强大牛逼的HTML解析库---Beautiful Soup，面对html的解析毫无压力，有多强？下面给大家慢慢道来！

2.Beautiful Soup是个啥？

“

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

”

当然上面是官方介绍的，在我看来其实就是帮助我们去获取一个网页上的html数据的库，他会帮我们解析出html，并且把解析后的数据返回给我们。相对于正则表达式，可能会更加的简单好用

其实Beautiful Soup有两个版本，我们所讲的版本是4，他还有一个版本是3，为什么不讲3呢？看官方怎么说的---“Beautiful Soup 3 目前已经停止开发,我们推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4”，没错，停止开发了，所以我们也没什么必要去学习3的知识。

3.Beautiful Soup的安装

如果你用的是新版的Debain或ubuntu,那么可以通过系统的软件包管理来安装:

$ apt-get install Python-bs4

Beautiful Soup 4 通过PyPi发布,所以如果你无法使用系统包管理安装,那么也可以通过 easy_install 或 pip 来安装.包的名字是 beautifulsoup4 ,这个包兼容Python2和Python3.

$ easy_install beautifulsoup4$ pip install beautifulsoup4

(在PyPi中还有一个名字是 BeautifulSoup 的包,但那可能不是你想要的,那是 Beautiful Soup3 的发布版本,因为很多项目还在使用BS3, 所以 BeautifulSoup 包依然有效.但是如果你在编写新项目,那么你应该安装的 beautifulsoup4

如果你没有安装 easy_install 或 pip ,那你也可以下载BS4的源码 ,然后通过setup.py来安装.

$ Python setup.py install

如果上述安装方法都行不通,Beautiful Soup的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.

作者在Python2.7和Python3.2的版本下开发Beautiful Soup, 理论上Beautiful Soup应该在所有当前的Python版本中正常工作(摘自官方)。

安装完soup之后，我们其实还需要去安装一个解析器：

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:

$ apt-get install Python-lxml$ easy_install lxml$ pip install lxml

另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:

$ apt-get install Python-html5lib$ easy_install html5lib$ pip install html5lib

推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.

4.开始动手实践

安装完beautifulsoup之后，我们来快速使用一下它！

快速使用

首先我们需要导包 from bs4 import BeautifulSoup，然后我们来定义一串字符串，这串字符串里面是html的源码。

html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""

我们之后的操作都是基于上面这个字符串来的，我们使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

# 解析HTML，并且返回一个beautifulsoup对象soup = BeautifulSoup(html_doc,"html.parser")# 按照格式输出print(soup.prettify())

打印结果：

<html> <head>  <title>   The Dormouse's story  </title> </head> <body>  <p class="title">   <b>    The Dormouse's story   </b>  </p>  <p class="story">   Once upon a time there were three little sisters; and their names were   <a class="sister" href="http://example.com/elsie" id="link1">    Elsie   </a>   ,   <a class="sister" href="http://example.com/lacie" id="link2">    Lacie   </a>   and   <a class="sister" href="http://example.com/tillie" id="link3">    Tillie   </a>   ;and they lived at the bottom of a well.  </p>  <p class="story">   ...  </p> </body></html>

接下来给大家演示几个常用的浏览结构化数据的方法：

print(soup.title)print(soup.title.name)print(soup.title.string)print(soup.title.parent.name)print(soup.p)print(soup.p['class'])print(soup.a)# 返回一个数组print(soup.find_all('a'))print(soup.find(id="link3"))

打印出结果：

<title>The Dormouse's story</title>titleThe Dormouse's storyhead<p class="title"><b>The Dormouse's story</b></p>['title']<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

2.Tag对象

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag , NavigableString , BeautifulSoup , Comment .

我们先来谈谈Tag对象，Tag对象与XML或HTML原生文档中的tag相同，其实就是一个标记，举个小栗子吧：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>

上面的a以及它里面的内容就称为Tag对象，怎么去抽取这些对象，其实上面的快速开始

中，我已经写过了，那些都是去拿到这个Tag对象。每个Tag对象都有它的名字，可以通过.name去获取。

Tag其实不仅仅能获取name，还能够修改name，举个小栗子：

# 将title改成mytitlesoup.title.name="mytitle"print(soup.title)print(soup.mytitle)

输出结果：

None<mytitle>The Dormouse's story</mytitle>

再来说一说Tag里面的属性吧，看下面一段代码：

<p class="title"><b>The Dormouse's story</b></p>

这个就是我们上面html中的一段代码，我们可以看到里面有class并且值是title，Tag的属性的操作方法与字典相同。

print(soup.p['class'])print(soup.p.get('class'))

输出结果：

['title']['title']

其实我们也可以通过“点”来取属性，比如：.attrs，用于获取Tag中所有的属性：

print(soup.p.attrs)

输出结果：

{'class': ['title']}

2.NavigableString

有时候我们是需要获取标签中的内容，那么怎么去获取呢?这里我们就需要用到.string，给大家看下代码吧！

print(soup.p.string)

输出结果：

The Dormouse's story

BeautifulSoup用NavigableString类来包装Tag中的字符串，一个NavigableString字符和Unicode字符串相同，通过unicode()方法可以直接将NavigableString对象转换成Unicode字符串

3.搜索文档树

BeautifulSoup定义了很多的搜索方法，其中最常用的是find_all()方法，我们就拿这个来讲讲吧，其他的方法都是类似的，大家可以举一反三。

我们来看一下函数的源代码

find_all(self, name=None, attrs={}, recursive=True, text=None,                 limit=None, **kwargs)

name:查找到所有名字为name的标记，字符串对象会被自动忽略掉。name参数的取值可以是字符串、正则表达式、列表、True和方法。

举个小栗子：

a = soup.find_all("a")print(a)

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

可以看到它返回的是一个列表list

kwargs参数：kwargs在python中表示的是keyword参数。如果一个指定的名字的参数不是搜索的参数名，这个时候搜索的是指定名字的Tag的属性。搜索指定名字的属性时可以使用的参数值包括字符串、正则表达式、列表、True。

举个小栗子：

a = soup.find_all(id='link2')

输出结果：

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

还有一个小栗子也给大家看看：

a = soup.find_all(id=True)

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

可以看出当值为True时，它会获取到所有含有这个键的Tag对象。

再来一个小栗子：

a = soup.find_all("a", class_="sister")print(a)

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

这个应该很容易理解了，就是找到a标签下，class属性为sister的Tag对象，但是这里需要注意的是class后面需要加下划线！！！

text：通过text参数，我们可以搜索文档中的字符串内容。与name参数的可选值是相同的。

举个小栗子：

a = soup.find_all("a", text="Lacie")print(a)

输出结果：

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

可以看到，text参数和其他参数的完美配合！

limit：我们可以通过limit参数来限制返回的结果数量。其实效果和SQL语句中的limit效果是一样的。这里就不给大家演示了。

recursive：调用tag的find_all()方法时，Beautiful Soup会检索当前tag的所有子孙节点，如果只想搜索tag的直接子节点，可以使用该参数并且将值为False。

find_all（）方法就讲解到这里，其他的一些搜索方法，大家可以点击左下角的“阅读原文”进行查看！

5.END

Beautiful Soup4咱们今天就扯到这里了，我只是把一些基础和常用的方法来进行了讲解，如果需要了解更多的关于 Beautiful Soup4的用法，可以点击左下角的“阅读原文”！以上文章纯手打，如果文章中有什么错误，可以在公众号后台回复消息给我！

如果你觉得这篇文章对你有所帮助，可以点击右下角的“在看”或者给JAP君加个小鸡腿！JAVAandPython君---一个坚持原创技术文章的公众号！

上一篇 : ：JAVA 注解的几大作用及使用方法详解下一篇 : 小白学自动化！终于开始写了！！！

HTML解析大法|牛逼的Beautiful Soup！

相关阅读

推荐文章