更新时间:2023-11-03 13:17:34
有一种方法可以使用 BeautifulSoup 和一个简单的 reg-ex:
There is a way to do this with BeautifulSoup and a simple reg-ex:
将段落放在 BeautifulSoup 对象中,例如,soupParagraph.
Put the paragraph in a BeautifulSoup object, e.g., soupParagraph.
对于开始 (<p>
) 和结束 (</p>
) 标记之间的内容,将内容移动到另一个 BeautifulSoup 对象,例如,soupInnerParagraph.(通过移动内容,它们不会被删除).
For the contents between the opening (<p>
) and closing (</p>
) tags, move the contents to another BeautifulSoup object, e.g., soupInnerParagraph. (By moving the contents, they are not deleted).
然后,soupParagraph 将只有开始和结束标签.
Then, soupParagraph will just have the opening and closing tags.
将 soupParagraph 转换为 HTML 文本格式并将其存储在字符串变量中
Convert soupParagraph to HTML text-format and store that in a string variable
要获取开始标签,请使用正则表达式从字符串变量中删除结束标签.
To get the opening tag, use a regular expression to remove the closing tag from the string variable.
一般来说,用正则表达式解析 HTML 是有问题的,通常***避免.但是,这里可能是合理的.
In general, parsing HTML with a regular-expression is problematic, and usually best avoided. However, it may be reasonable here.
结束标签很简单.它没有为其定义属性,并且不允许在其中添加注释.
A closing tag is simple. It does not have attributes defined for it, and a comment is not allowed within it.
此代码从 <body...>
... </body>
部分获取开始标记.代码已经过测试.
This code gets the opening tag from a <body...>
... </body>
section. The code has been tested.
# The variable "body" is a BeautifulSoup object that contains a <body> section.
bodyInnerHtml = BeautifulSoup("", 'html.parser')
bodyContentsList = body.contents
for i in range(0, len(bodyContentsList)):
# .append moves the HTML element from body to bodyInnerHtml
bodyInnerHtml.append(bodyContentsList[0])
# Convert the <body> opening and closing tags to HTML text format
bodyTags = body.decode(formatter='html')
# Extract the opening tag, by removing the closing tag
regex = r"(s*</bodys*>s*$)"
substitution = ""
bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M)
if (substitutionCount != 1):
print("")
print("ERROR. The expected HTML </body> tag was not found.")