且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

删除标记使用BeautifulSoup但保留其内容

更新时间:2023-12-05 20:53:16

我用的策略是,以取代其内容的标签,如果他们是类型 NavigableString ,如果他们都没有,然后递归到他们和 NavigableString 替换它们的内容,等等。试试这个:

The strategy I used is to replace a tag with its contents if they are of type NavigableString and if they aren't, then recurse into them and replace their contents with NavigableString, etc. Try this:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

的结果是:

<p>Good, bad, and ugly</p>

我介绍了另一个问题,这个相同的答案。它似乎来了不少。

I gave this same answer on another question. It seems to come up a lot.