且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用python请求和事件钩子编写一个web回调函数的回调函数?

更新时间:2023-01-02 15:50:53

p>这里是我如何做:

  import grequests 
from bs4 import BeautifulSoup


def get_urls_from_response(r):
soup = BeautifulSoup(r.text)
urls = [link.get('href')for link in soup.find_all('a') ]
return urls


def print_url(args):
print args ['url']


def recursive_urls (urls):

给定起始URL列表,递归查找所有后代urb
递归

如果len 0:
return
rs = [grequests.get(url,hooks = dict(args = print_url))for url in urls]
responses = grequests.map(rs)
url_lists = [get_urls_from_response(response)for response in responses]
urls = sum(url_lists,[])#将列表列表放入列表
recursive_urls(urls)
$ p>

我没有测试代码,但一般的想法是。



注意我使用 grequests 代替请求以提高性能。 grequest 基本上是 gevent + request ,根据我的经验,因为您检索与 gevent 异步的链接。






编辑:此处是不使用递归的相同算法:

  import grequests 
from bs4 import BeautifulSoup


def get_urls_from_response(r):
soup = BeautifulSoup )
urls = [link.get('href')for the link in soup.find_all('a')]
return urls


def print_url ):
print args ['url']


def recursive_urls(urls):

给出起始URL列表,递归发现所有后代urls
递归

while True:
如果len(urls)== 0:
break
rs = [grequests。 get(url,hooks = dict(args = print_url))for url in urls]
responses = grequests.map(rs)
url_lists = [get_urls_from_response(response)for response in responses]
urls = sum(url_lists,[])#将列表列表放入列表

如果__name__ ==__main__:
recursive_urls([INITIAL_URLS])


I've recently taken a look at the python-requests module and I'd like to write a simple web crawler with it. Given a collection of start urls, I want to write a Python function that searches the webpage content of the start urls for other urls and then calls the same function again as a callback with the new urls as input, and so on. At first, I thought that event hooks would be the right tool for this purpose but its documentation part is quite sparse. On another page I read that functions which are used for event hooks have to return the same object that was passed to them. So event hooks are obviously not feasible for this kind of task. Or I simply didn't get it right...

Here is some pseudocode of what I want to do (borrowed from a pseudo Scrapy spider):

import lxml.html    

def parse(response):
    for url in lxml.html.parse(response.url).xpath('//@href'):
        return Request(url=url, callback=parse)

Can someone give me an insight on how to do it with python-requests? Are event hooks the right tool for that or do I need something different? (Note: Scrapy is not an option for me due to various reasons.) Thanks a lot!

Here is how I would do it:

import grequests
from bs4 import BeautifulSoup


def get_urls_from_response(r):
    soup = BeautifulSoup(r.text)
    urls = [link.get('href') for link in soup.find_all('a')]
    return urls


def print_url(args):
    print args['url']


def recursive_urls(urls):
    """
    Given a list of starting urls, recursively finds all descendant urls
    recursively
    """
    if len(urls) == 0:
        return
    rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
    responses = grequests.map(rs)
    url_lists = [get_urls_from_response(response) for response in responses]
    urls = sum(url_lists, [])  # flatten list of lists into a list
    recursive_urls(urls)

I haven't tested the code but the general idea is there.

Note that I am using grequests instead of requests for performance boost. grequest is basically gevent+request, and in my experience it is much faster for this sort of tasks because of you retrieve links asynchronous with gevent.


Edit: here is the same algorithm without using recursion:

import grequests
from bs4 import BeautifulSoup


def get_urls_from_response(r):
    soup = BeautifulSoup(r.text)
    urls = [link.get('href') for link in soup.find_all('a')]
    return urls


def print_url(args):
    print args['url']


def recursive_urls(urls):
    """
    Given a list of starting urls, recursively finds all descendant urls
    recursively
    """
    while True:
        if len(urls) == 0:
            break
        rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
        responses = grequests.map(rs)
        url_lists = [get_urls_from_response(response) for response in responses]
        urls = sum(url_lists, [])  # flatten list of lists into a list

if __name__ == "__main__":
    recursive_urls(["INITIAL_URLS"])