更新时间:2023-01-02 15:50:53
p>这里是我如何做:
import grequests
$ p>
from bs4 import BeautifulSoup
def get_urls_from_response(r):
soup = BeautifulSoup(r.text)
urls = [link.get('href')for link in soup.find_all('a') ]
return urls
def print_url(args):
print args ['url']
def recursive_urls (urls):
给定起始URL列表,递归查找所有后代urb
递归
如果len 0:
return
rs = [grequests.get(url,hooks = dict(args = print_url))for url in urls]
responses = grequests.map(rs)
url_lists = [get_urls_from_response(response)for response in responses]
urls = sum(url_lists,[])#将列表列表放入列表
recursive_urls(urls)我没有测试代码,但一般的想法是。
注意我使用
grequests
代替请求
以提高性能。grequest
基本上是gevent + request
,根据我的经验,因为您检索与gevent
异步的链接。编辑:此处是不使用递归的相同算法:
import grequests
from bs4 import BeautifulSoup
def get_urls_from_response(r):
soup = BeautifulSoup )
urls = [link.get('href')for the link in soup.find_all('a')]
return urls
def print_url ):
print args ['url']
def recursive_urls(urls):
给出起始URL列表,递归发现所有后代urls
递归
while True:
如果len(urls)== 0:
break
rs = [grequests。 get(url,hooks = dict(args = print_url))for url in urls]
responses = grequests.map(rs)
url_lists = [get_urls_from_response(response)for response in responses]
urls = sum(url_lists,[])#将列表列表放入列表
如果__name__ ==__main__:
recursive_urls([INITIAL_URLS])
I've recently taken a look at the python-requests module and I'd like to write a simple web crawler with it. Given a collection of start urls, I want to write a Python function that searches the webpage content of the start urls for other urls and then calls the same function again as a callback with the new urls as input, and so on. At first, I thought that event hooks would be the right tool for this purpose but its documentation part is quite sparse. On another page I read that functions which are used for event hooks have to return the same object that was passed to them. So event hooks are obviously not feasible for this kind of task. Or I simply didn't get it right...
Here is some pseudocode of what I want to do (borrowed from a pseudo Scrapy spider):
import lxml.html def parse(response): for url in lxml.html.parse(response.url).xpath('//@href'): return Request(url=url, callback=parse)
Can someone give me an insight on how to do it with python-requests? Are event hooks the right tool for that or do I need something different? (Note: Scrapy is not an option for me due to various reasons.) Thanks a lot!
Here is how I would do it:
import grequests from bs4 import BeautifulSoup def get_urls_from_response(r): soup = BeautifulSoup(r.text) urls = [link.get('href') for link in soup.find_all('a')] return urls def print_url(args): print args['url'] def recursive_urls(urls): """ Given a list of starting urls, recursively finds all descendant urls recursively """ if len(urls) == 0: return rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls] responses = grequests.map(rs) url_lists = [get_urls_from_response(response) for response in responses] urls = sum(url_lists, []) # flatten list of lists into a list recursive_urls(urls)
I haven't tested the code but the general idea is there.
Note that I am using
grequests
instead ofrequests
for performance boost.grequest
is basicallygevent+request
, and in my experience it is much faster for this sort of tasks because of you retrieve links asynchronous withgevent
.Edit: here is the same algorithm without using recursion:
import grequests from bs4 import BeautifulSoup def get_urls_from_response(r): soup = BeautifulSoup(r.text) urls = [link.get('href') for link in soup.find_all('a')] return urls def print_url(args): print args['url'] def recursive_urls(urls): """ Given a list of starting urls, recursively finds all descendant urls recursively """ while True: if len(urls) == 0: break rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls] responses = grequests.map(rs) url_lists = [get_urls_from_response(response) for response in responses] urls = sum(url_lists, []) # flatten list of lists into a list if __name__ == "__main__": recursive_urls(["INITIAL_URLS"])