更新时间:2023-12-04 10:48:10
你可以这样做:
breadcrum = [item.strip() 用于面包屑中的项目 if str(item)]
if str(item)
将在去除换行符后处理去除空列表项.
如果你想加入字符串,那么做:
','.join(面包屑)
这会给你 abc,def,ghi
编辑
虽然上面给了你你想要的东西,正如线程中的其他人所指出的那样,你使用 BS 提取锚文本的方式是不正确的.一旦你有了你感兴趣的 div
,你应该使用它来获取它的子元素,然后获取锚文本.如:
path = soup.find('div',attrs={'class':'path'})锚点 = path.find_all('a')数据 = []对于锚点中的 ele:数据附加(电子文本)
然后做一个','.join(data)
I am using python Beautiful soup to get the contents of:
<div class="path">
<a href="#"> abc</a>
<a href="#"> def</a>
<a href="#"> ghi</a>
</div>
My code is as follows:
html_doc="""<div class="path">
<a href="#"> abc</a>
<a href="#"> def</a>
<a href="#"> ghi</a>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
path = soup.find('div',attrs={'class':'path'})
breadcrum = path.findAll(text=True)
print breadcrum
The output is as follow,
[u'
', u'abc', u'
', u'def', u'
', u'ghi',u'
']
How can I only get the result in this form: abc,def,ghi
as a single string?
Also I want to know about the output so obtained.
You could do this:
breadcrum = [item.strip() for item in breadcrum if str(item)]
The if str(item)
will take care of getting rid of the empty list items after stripping the new line characters.
If you want to join the strings, then do:
','.join(breadcrum)
This will give you abc,def,ghi
EDIT
Although the above gives you what you want, as pointed out by others in the thread, the way you are using BS to extract anchor texts is not correct. Once you have the div
of your interest, you should be using it to get it's children and then get the anchor text. As:
path = soup.find('div',attrs={'class':'path'})
anchors = path.find_all('a')
data = []
for ele in anchors:
data.append(ele.text)
And then do a ','.join(data)