且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何从 HTML 字符串中获取美丽汤中的开始和结束标记?

更新时间:2023-11-03 13:17:34

有一种方法可以使用 BeautifulSoup 和一个简单的 reg-ex:

There is a way to do this with BeautifulSoup and a simple reg-ex:

  • 将段落放在 BeautifulSoup 对象中,例如,soupParagraph.

  • Put the paragraph in a BeautifulSoup object, e.g., soupParagraph.

对于开始 (<p>) 和结束 (</p>) 标记之间的内容,将内容移动到另一个 BeautifulSoup 对象,例如,soupInnerParagraph.(通过移动内容,它们不会被删除).

For the contents between the opening (<p>) and closing (</p>) tags, move the contents to another BeautifulSoup object, e.g., soupInnerParagraph. (By moving the contents, they are not deleted).

然后,soupParagraph 将只有开始和结束标签.

Then, soupParagraph will just have the opening and closing tags.

将 soupParagraph 转换为 HTML 文本格式并将其存储在字符串变量中

Convert soupParagraph to HTML text-format and store that in a string variable

要获取开始标签,请使用正则表达式从字符串变量中删除结束标签.

To get the opening tag, use a regular expression to remove the closing tag from the string variable.

一般来说,用正则表达式解析 HTML 是有问题的,通常***避免.但是,这里可能是合理的.

In general, parsing HTML with a regular-expression is problematic, and usually best avoided. However, it may be reasonable here.

结束标签很简单.它没有为其定义属性,并且不允许在其中添加注释.

A closing tag is simple. It does not have attributes defined for it, and a comment is not allowed within it.

我可以在结束标签上有属性吗?

元素开始标签内的HTML注释

此代码从 <body...> ... </body> 部分获取开始标记.代码已经过测试.

This code gets the opening tag from a <body...> ... </body> section. The code has been tested.

# The variable "body" is a BeautifulSoup object that contains a <body> section.
bodyInnerHtml = BeautifulSoup("", 'html.parser')
bodyContentsList = body.contents
for i in range(0, len(bodyContentsList)):
    # .append moves the HTML element from body to bodyInnerHtml
    bodyInnerHtml.append(bodyContentsList[0])

# Convert the <body> opening and closing tags to HTML text format
bodyTags = body.decode(formatter='html')
# Extract the opening tag, by removing the closing tag
regex = r"(s*</bodys*>s*$)"
substitution = ""
bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M)
if (substitutionCount != 1):
    print("")
    print("ERROR.  The expected HTML </body> tag was not found.")