且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从 python BeautifulSoup 的输出中删除新行 ' '

更新时间:2023-12-04 10:48:10

你可以这样做:

breadcrum = [item.strip() 用于面包屑中的项目 if str(item)]

if str(item) 将在去除换行符后处理去除空列表项.

如果你想加入字符串,那么做:

','.join(面包屑)

这会给你 abc,def,ghi

编辑

虽然上面给了你你想要的东西,正如线程中的其他人所指出的那样,你使用 BS 提取锚文本的方式是不正确的.一旦你有了你感兴趣的 div,你应该使用它来获取它的子元素,然后获取锚文本.如:

path = soup.find('div',attrs={'class':'path'})锚点 = path.find_all('a')数据 = []对于锚点中的 ele:数据附加(电子文本)

然后做一个','.join(data)

I am using python Beautiful soup to get the contents of:

<div class="path">
    <a href="#"> abc</a>
    <a href="#"> def</a>
    <a href="#"> ghi</a>
</div>

My code is as follows:

html_doc="""<div class="path">
    <a href="#"> abc</a>
    <a href="#"> def</a>
    <a href="#"> ghi</a>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

path = soup.find('div',attrs={'class':'path'})
breadcrum = path.findAll(text=True)

print breadcrum

The output is as follow,

[u'
', u'abc', u'
', u'def', u'
', u'ghi',u'
']

How can I only get the result in this form: abc,def,ghi as a single string?

Also I want to know about the output so obtained.

You could do this:

breadcrum = [item.strip() for item in breadcrum if str(item)]

The if str(item) will take care of getting rid of the empty list items after stripping the new line characters.

If you want to join the strings, then do:

','.join(breadcrum)

This will give you abc,def,ghi

EDIT

Although the above gives you what you want, as pointed out by others in the thread, the way you are using BS to extract anchor texts is not correct. Once you have the div of your interest, you should be using it to get it's children and then get the anchor text. As:

path = soup.find('div',attrs={'class':'path'})
anchors = path.find_all('a')
data = []
for ele in anchors:
    data.append(ele.text)

And then do a ','.join(data)

相关阅读

推荐文章