且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在pyspark中并行下载大量URL?

更新时间:2023-02-26 13:13:31

如果您使用的是concurrent.futures,则根本不需要asyncio(因为无论如何您都在多个线程中运行,因此不会带来任何好处) .您可以使用concurrent.futures.wait()并行等待多个期货.

If you're using concurrent.futures, you don't need asyncio at all (it will bring you no benefits since you are running in multiple threads anyway). You can use concurrent.futures.wait() to wait for multiple futures in parallel.

我无法测试您的数据,但是它应该可以与以下代码一起使用:

I can't test your data, but it should work with code like this:

import concurrent.futures, requests

def get_one(url):
    resp = requests.get(url)
    resp.raise_for_status()
    return resp.text

def get_all():
    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
        futures = [executor.submit(get_one, url)
                   for url in urls.toLocalIterator()]
    # the end of the "with" block will automatically wait
    # for all of the executor's tasks to complete

    for fut in futures:
        if fut.exception() is not None:
            print('{}: {}'.format(fut.exception(), 'ERR')
        else:
            print('{}: {}'.format(fut.result(), 'OK')

要对asyncio执行相同的操作,应改用 aiohttp .

To do the same thing with asyncio, you should use aiohttp instead.