且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

html文本中链接的正则表达式

更新时间:2023-02-23 12:54:25

正如其他人所建议的,如果不需要类似实时的性能,BeautifulSoup 是一个很好的解决方案:

As others have suggested, if real-time-like performance isn't necessary, BeautifulSoup is a good solution:

import urllib2
from BeautifulSoup import BeautifulSoup

html = urllib2.urlopen("http://www.google.com").read()
soup = BeautifulSoup(html)
all_links = soup.findAll("a")

关于第二个问题,是的,HTML链接应该是明确定义的,但是你实际遇到的HTML不太可能是标准的.BeautifulSoup 的美妙之处在于它使用类似浏览器的启发式方法来尝试解析您可能实际遇到的非标准、格式错误的 HTML.

As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across.

如果您确定要使用标准 XHTML,则可以使用(快得多)速度的 XML 解析器,例如 expat.

If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.

正则表达式,由于上述原因(解析器必须维护状态,而正则表达式不能这样做)永远不会是一个通用的解决方案.

Regex, for the reasons above (the parser must maintain state, and regex can't do that) will never be a general solution.