正则表达式在HTML文本中的链接

更新时间：2023-02-23 12:49:41

As others have suggested, if real-time-like performance isn't necessary, BeautifulSoup is a good solution:

import urllib2
from BeautifulSoup import BeautifulSoup

html = urllib2.urlopen("http://www.google.com").read()
soup = BeautifulSoup(html)
all_links = soup.findAll("a")

As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across.

If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.

Regex, for the reasons above (the parser must maintain state, and regex can't do that) will never be a general solution.

上一篇 : ：java-通过数字行从JTextArea获取文本下一篇 : 使用 PUMA 在没有密码的情况下将用户登录到 Websphere Portal

正则表达式在HTML文本中的链接

相关阅读

技术问答最新文章