且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

为Scrapy构建一个RESTful Flask API

更新时间:2023-11-30 13:08:16

认为没有一种方法可以为Scrapy创建基于Flask的API。 Flask不是一个正确的工具,因为它不是基于事件循环的。更糟糕的是,Twisted反应堆(Scrapy使用的)不能在一个开始/停止一次以上单线程。



假设Twisted reactor没有问题,您可以启动和停止它。它不会让事情变得更好,因为你的 scrape_it 函数可能会阻塞很长一段时间,所以你将需要很多线程/进程。
$ b

我认为***的方式是使用Twisted或者Tornado这样的异步框架创建一个API;它将比基于Flask的(或基于Django的)解决方案更有效率,因为当Scrapy正在运行一个蜘蛛时,API将能够处理请求。



Scrapy是基于Twisted,所以使用twisted.web或者 https://github.com/twisted/klein 可以更直接。但龙卷风也不难,因为你可以使它使用Twisted事件循环。

有一个项目称为 ScrapyRT ,它和你想要实现的东西非常类似 - 它是一个Scrapy的HTTP API。 ScrapyRT基于Twisted。



作为Scrapy-Tornado集成检查的一个例子,请参阅 Arachnado - 这里是一个关于如何整合Scrapy的CrawlerProcess与龙卷风的应用程序。

如果你真的想要基于Flask的API,那么开始在单独的进程中进行爬网和/或像Celery一样使用队列解决方案是有意义的。这样你就失去了大部分的Scrapy效率。如果你这样做,你也可以使用request + BeautifulSoup。


The API should allow arbitrary HTTP get requests containing URLs the user wants scraped, and then Flask should return the results of the scrape.

The following code works for the first http request, but after twisted reactor stops, it won't restart. I may not even be going about this the right way, but I just want to put a RESTful scrapy API up on Heroku, and what I have so far is all I can think of.

Is there a better way to architect this solution? Or how can I allow scrape_it to return without stopping twisted reactor (which can't be started again)?

from flask import Flask
import os
import sys
import json

from n_grams.spiders.n_gram_spider import NGramsSpider

# scrapy api
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals

app = Flask(__name__)


def scrape_it(url):
    items = []
    def add_item(item):
        items.append(item)

    runner = CrawlerRunner()

    d = runner.crawl(NGramsSpider, [url])
    d.addBoth(lambda _: reactor.stop()) # <<< TROUBLES HERE ???

    dispatcher.connect(add_item, signal=signals.item_passed)

    reactor.run(installSignalHandlers=0) # the script will block here until the crawling is finished


    return items

@app.route('/scrape/<path:url>')
def scrape(url):

    ret = scrape_it(url)

    return json.dumps(ret, ensure_ascii=False, encoding='utf8')


if __name__ == '__main__':
    PORT = os.environ['PORT'] if 'PORT' in os.environ else 8080

    app.run(debug=True, host='0.0.0.0', port=int(PORT))

I think there is no a good way to create Flask-based API for Scrapy. Flask is not a right tool for that because it is not based on event loop. To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread.

Let's assume there is no problem with Twisted reactor and you can start and stop it. It won't make things much better because your scrape_it function may block for an extended period of time, and so you will need many threads/processes.

I think the way to go is to create an API using async framework like Twisted or Tornado; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider.

Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. But Tornado is also not hard because you can make it use Twisted event loop.

There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. ScrapyRT is based on Twisted.

As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application.

If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. This way you're loosing most of the Scrapy efficiency; if you go this way you can use requests + BeautifulSoup as well.