且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用漂亮的汤刮网-如何获得所有类别

更新时间:2023-02-14 10:38:26

如果您只想从已发布的结果中删除链接,则可以这样获得:

If you just want the links out of the results you already posted, you can get that like this:

import requests 
from bs4 import BeautifulSoup

page = "https://www.sfma.org.sg/member/category/manufacturer"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')
links = soup.find_all('a', attrs ={'class' :'plink'})
for link in links:
    print(link['href'])

输出:

../info/{{permalink}}
http://www.sfma.org.sg/about/singapore-food-manufacturers-association
http://www.sfma.org.sg/about/council-members
http://www.sfma.org.sg/about/history-and-milestones
http://www.sfma.org.sg/membership/
http://www.sfma.org.sg/member/
http://www.sfma.org.sg/member/alphabet/
http://www.sfma.org.sg/member/category/
http://www.sfma.org.sg/resources/sme-portal
http://www.sfma.org.sg/resources/setting-up-food-establishments-in-singapore
http://www.sfma.org.sg/resources/import-export-requirements-and-procedures
http://www.sfma.org.sg/resources/labelling-guidelines
http://www.sfma.org.sg/resources/wsq-continuing-education-modular-programmes
http://www.sfma.org.sg/resources/holistic-industry-productivity-scorecard
http://www.sfma.org.sg/resources/p-max
http://www.sfma.org.sg/event/
http://www.sfma.org.sg/news/
http://www.fipa.com.sg/
http://www.sfma.org.sg/stp
http://www.sgfoodgifts.sg/

但是,如果您想要链接到网站上每个条目的链接,则需要将永久链接值与基本URL结合在一起.我已经从nag扩展了该答案,以帮助从您正在查看的网站获取所需的数据.第二个列表中显示了永久链接值,这些值不起作用(食品/饮料类型,而不是公司),因此我将其删除.

However, if you want the links to each of the entries on the website, you need to join the permalink values with the base url. I've extended that answer from nag to help get the data you want from the website you are looking at. There are permalink values that appear in a second list, and don't work (food/beverage types, rather than companies) so I'm removing them.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re


page = "https://www.sfma.org.sg/member/category/manufacturer"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')

url_list = []

script_sections = soup.find_all('script')
for i in range(len(script_sections)):
    if len(script_sections[i].contents) >= 1:
        txt = script_sections[i].contents[0]
        pattern = re.compile(r'permalink:\'(.*?)\'')
        permlinks = re.findall(pattern, txt)
        for i in permlinks:
            href = "../info/{{permalink}}"
            href = href.split('{')[0]+i
            full_url = urljoin(page, href)
            if full_url in url_list:
                # drop the repeat extras?
                url_list.remove(full_url)
            else:
                url_list.append(full_url)

for urls in url_list:
    print(urls)

输出(被截断):

https://www.sfma.org.sg/member/info/1a-catering-pte-ltd
https://www.sfma.org.sg/member/info/a-linkz-marketing-pte-ltd
https://www.sfma.org.sg/member/info/aalst-chocolate-pte-ltd
https://www.sfma.org.sg/member/info/abb-pte-ltd
https://www.sfma.org.sg/member/info/ace-synergy-international-pte-ltd
https://www.sfma.org.sg/member/info/acez-instruments-pte-ltd
https://www.sfma.org.sg/member/info/acorn-investments-holding-pte-ltd
https://www.sfma.org.sg/member/info/ad-wright-communications-pte-ltd
https://www.sfma.org.sg/member/info/added-international-s-pte-ltd
https://www.sfma.org.sg/member/info/advance-carton-pte-ltd
https://www.sfma.org.sg/member/info/agroegg-pte-ltd
https://www.sfma.org.sg/member/info/airverclean-pte-ltd
...