且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从html页面中删除所有样式,脚本和html标签

更新时间:2023-12-05 10:40:34

它看起来像你几乎拥有它。您还需要删除html标签和css样式代码。这里是我的解决方案(我更新了函数):

It looks like you almost have it. You need to also remove the html tags and css styling code. Here is my solution (I updated the function):

def cleanMe(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text