且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

访问Django模型与scrapy:定义Django项目的路径

更新时间:2023-12-01 22:06:22

我认为主要的误解是包路径与设置模块路径。为了从外部脚本使用django的模型,您需要设置 DJANGO_SETTINGS_MODULE 。然后,该模块必须是可导入(即如果设置路径是 myproject.settings ,则myproject的语句导入设置应该在python shell中工作)



由于django中的大多数项目都是在默认 PYTHONPATH ,您必须将项目的路径添加到 PYTHONPATH 环境变量。



这是一个分步指南,用于创建一个完全正常(最小)的Django模型集成到Scrapy项目中:



注意:此说明在上次编辑日期工作。如果它不适用于您,请添加评论并描述您的问题和scrapy / django版本。


  1. 项目将在 / home / rolando / projects 目录中创建。


  2. 启动 django项目

      $ cd〜/ projects 
    $ django-admin startproject myweb
    $ cd myweb
    $ ./manage.py startapp myapp


  3. code> myapp / models.py 。

     从django.db导入模型


    class Person(models.Model):
    name = models.CharField(max_length = 32)

    myapp INSTALLED_APPS

  4. > myweb / settings.py 。
     #在最后的settings.py 
    INSTALLED_APPS + =('myapp',)


  5. 将我的数据库设置设置为 myweb / settings.py

     #在settings.py 
    DATABASES ['default'] ['ENGINE'] ='django.db.backends.sqlite3'
    DATABASES ['default'] ['NAME'] ='/ tmp / myweb。 db'


  6. 创建数据库。

      $ ./manage.py syncdb --noinput 
    创建表...
    安装自定义SQL ...
    安装索引...
    从0个夹具中安装0个对象


  7. 创建 scrapy项目

      $ cd〜/ projects 
    $ scrapy startproject mybot
    $ cd mybot


  8. mybot / items中创建一个项目。 py


注意 ,您需要安装 scrapy_djangoitem ,并使用scrapy_djangoitem导入DjangoItem 中的

  from scrapy.contrib.djangoitem import来自scrapy.item导入的DjangoItem 
字段

from mya pp.models import Person


class PersonItem(DjangoItem):
这个项目的#字段是从django模型自动创建的
django_model = Person

最终的目录结构是这样的:

  / home / rolando / projects 
├──mybot
│├──mybot
││├──__init__.py
││├── items.py
││├──pipelines.py
││├──settings.py
││└──蜘蛛
││└──__init__.py
│└──scrapy.cfg
└──myweb
├──manage.py
├──myapp
│├──__init__.py
│├──models.py
│├──tests.py
│└──views.py
└──myweb
├──__init__.py
├──settings.py
├──ur ls.py
└──wsgi.py

从这里,基本上我们完成了在scrapy项目中使用django模型所需的代码。我们可以使用 scrapy shell 命令立即测试,但请注意所需的环境变量:

  $ cd〜/ projects / mybot 
$ PYTHONPATH =〜/ projects / myweb DJANGO_SETTINGS_MODULE = myweb.settings scrapy shell

#... scrapy banner,debug messages, python横幅等

在[1]:从mybot.items import PersonItem

在[2]中:i = PersonItem(name ='rolando')

在[3]中:i.save()
输出[3]:< Person:Person对象>

在[4]中:PersonItem.django_model.objects.get(name ='rolando')
Out [4]:< Person:Person object>

所以,它按照预期工作。



PYTHONPATH 中设置的路径中是***的。

这是最简单的解决方案之一:将这行添加到您的 mybot / settings.py 文件中以设置环境变量。

 #设置django的项目完整路径。 
import sys
sys.path.insert(0,'/ home / rolando / projects / myweb')

#设置django的设置模块名称。
#此模块位于/home/rolando/projects/myweb/myweb/settings.py。
import os
os.environ ['DJANGO_SETTINGS_MODULE'] ='myweb.settings'

#由于Django 1.7,需要安装()调用来填充应用程序注册表。
import django; django.setup()

注意:在两个项目中都有 setuptools - setup.py 文件,并运行 python设置。 py开发,将项目路径链接到python的路径(我假设你使用 virtualenv )。



那就够了为了完整起见,这里是一个完整工作项目的基本蜘蛛和管道:


  1. 创建蜘蛛。

      $ cd〜/ projects / mybot 
    $ scrapy genspider -t基本示例example.com

    蜘蛛代码:

     #file: mybot / spiders / example.py 
    from scrapy.spider import BaseSpider
    from mybot.items import PersonItem


    class ExampleSpider(BaseSpider):
    name =example
    allowed_domains = [example.com]
    start_urls = ['http://www.example.com/']

    def parse(self,回复):
    #do stuff
    return PersonItem(name ='rolando')


  2. mybot / pipelines.py 中创建管道以保存项目。

     





    $ b $ / code>

    如果您使用 DjangoItem $ c,则可以使用 item.save() $ c>类或直接导入django模型并手动创建对象。在这两种方式中,主要的问题是定义环境变量,以便您可以使用django模型。


  3. 将管道设置添加到 mybot / settings.py 文件。

      ITEM_PIPELINES = {
    'mybot.pipelines。 MybotPipeline':1000,
    }


  4. 运行蜘蛛。

      $ scrapy crawl example 



I'm very new to Python and Django. I'm currently exploring using Scrapy to scrape sites and save data to the Django database. My goal is to run a spider based on domain given by a user.

I've written a spider that extracts the data I need and store it correctly in a json file when calling

scrapy crawl spider -o items.json -t json

As described in the scrapy tutorial.

My goal is now to get the spider to succesfully to save data to the Django database, and then work on getting the spider to run based on user input.

I'm aware that various posts exists on this subject, such as these: link 1 link 2 link 3

But having spend more than 8 hours on trying to get this to work, I'm assuming i'm not the only one still facing issues with this. I'll therefor try and gather all the knowledge i've gotten so far in this post, as well a hopefully posting a working solution at a later point. Because of this, this post is rather long.

It appears to me that there is two different solutions to saving data to the Django database from Scrapy. One is to use DjangoItem, another is to to import the models directly(as done here).

I'm not completely aware of the advantages and disadvantages of these two, but it seems like the difference is simply the using DjangoItem is just more convenient and shorter.

What i've done:

I've added:

def setup_django_env(path):
    import imp, os
    from django.core.management import setup_environ

    f, filename, desc = imp.find_module('settings', [path])
    project = imp.load_module('settings', f, filename, desc)       

    setup_environ(project)

setup_django_env('/Users/Anders/DjangoTraining/wsgi/')

Error i'm getting is:

ImportError: No module named settings

I'm thinking i'm defining the path to my Django project in a wrong way?

I've also tried the following:

setup_django_env('../../') 

How do I define the path to my Django project correctly? (if that is the issue)

I think the main misconception is the package path vs the settings module path. In order to use django's models from an external script you need to set the DJANGO_SETTINGS_MODULE. Then, this module has to be importable (i.e. if the settings path is myproject.settings, then the statement from myproject import settings should work in a python shell).

As most projects in django are created in a path outside the default PYTHONPATH, you must add the project's path to the PYTHONPATH environment variable.

Here is a step-by-step guide to create a fully working (and minimal) Django models integration into a Scrapy project:

Note: This instructions work at the date of the last edit. If it doesn't work for you, please add a comment and describe your issue and scrapy/django versions.

  1. The projects will be created within /home/rolando/projects directory.

  2. Start the django project.

    $ cd ~/projects
    $ django-admin startproject myweb
    $ cd myweb
    $ ./manage.py startapp myapp
    

  3. Create a model in myapp/models.py.

    from django.db import models
    
    
    class Person(models.Model):
        name = models.CharField(max_length=32)
    

  4. Add myapp to INSTALLED_APPS in myweb/settings.py.

    # at the end of settings.py
    INSTALLED_APPS += ('myapp',)
    

  5. Set my db settings in myweb/settings.py.

    # at the end of settings.py
    DATABASES['default']['ENGINE'] = 'django.db.backends.sqlite3'
    DATABASES['default']['NAME'] = '/tmp/myweb.db'
    

  6. Create the database.

    $ ./manage.py syncdb --noinput
    Creating tables ...
    Installing custom SQL ...
    Installing indexes ...
    Installed 0 object(s) from 0 fixture(s)
    

  7. Create the scrapy project.

    $ cd ~/projects
    $ scrapy startproject mybot
    $ cd mybot
    

  8. Create an item in mybot/items.py.

Note: In newer versions of Scrapy, you need to install scrapy_djangoitem and use from scrapy_djangoitem import DjangoItem.

    from scrapy.contrib.djangoitem import DjangoItem
    from scrapy.item import Field

    from myapp.models import Person


    class PersonItem(DjangoItem):
        # fields for this item are automatically created from the django model
        django_model = Person

The final directory structure is this:

/home/rolando/projects
├── mybot
│   ├── mybot
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │       └── __init__.py
│   └── scrapy.cfg
└── myweb
    ├── manage.py
    ├── myapp
    │   ├── __init__.py
    │   ├── models.py
    │   ├── tests.py
    │   └── views.py
    └── myweb
        ├── __init__.py
        ├── settings.py
        ├── urls.py
        └── wsgi.py

From here, basically we are done with the code required to use the django models in a scrapy project. We can test it right away using scrapy shell command but be aware of the required environment variables:

$ cd ~/projects/mybot
$ PYTHONPATH=~/projects/myweb DJANGO_SETTINGS_MODULE=myweb.settings scrapy shell

# ... scrapy banner, debug messages, python banner, etc.

In [1]: from mybot.items import PersonItem

In [2]: i = PersonItem(name='rolando')

In [3]: i.save()
Out[3]: <Person: Person object>

In [4]: PersonItem.django_model.objects.get(name='rolando')
Out[4]: <Person: Person object>

So, it is working as intended.

Finally, you might not want to have to set the environment variables each time you run your bot. There are many alternatives to address this issue, although the best it is that the projects' packages are actually installed in a path set in PYTHONPATH.

This is one of the simplest solutions: add this lines to your mybot/settings.py file to set up the environment variables.

# Setting up django's project full path.
import sys
sys.path.insert(0, '/home/rolando/projects/myweb')

# Setting up django's settings module name.
# This module is located at /home/rolando/projects/myweb/myweb/settings.py.
import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'myweb.settings'

# Since Django 1.7, setup() call is required to populate the apps registry.
import django; django.setup()

Note: A better approach to the path hacking is to have setuptools-based setup.py files in both projects and run python setup.py develop which will link your project path into the python's path (I'm assuming you use virtualenv).

That is enough. For completeness, here is a basic spider and pipeline for a fully working project:

  1. Create the spider.

    $ cd ~/projects/mybot
    $ scrapy genspider -t basic example example.com
    

    The spider code:

    # file: mybot/spiders/example.py
    from scrapy.spider import BaseSpider
    from mybot.items import PersonItem
    
    
    class ExampleSpider(BaseSpider):
        name = "example"
        allowed_domains = ["example.com"]
        start_urls = ['http://www.example.com/']
    
        def parse(self, response):
            # do stuff
            return PersonItem(name='rolando')
    

  2. Create a pipeline in mybot/pipelines.py to save the item.

    class MybotPipeline(object):
        def process_item(self, item, spider):
            item.save()
            return item
    

    Here you can either use item.save() if you are using the DjangoItem class or import the django model directly and create the object manually. In both ways the main issue is to define the environment variables so you can use the django models.

  3. Add the pipeline setting to your mybot/settings.py file.

    ITEM_PIPELINES = {
        'mybot.pipelines.MybotPipeline': 1000,
    }
    

  4. Run the spider.

    $ scrapy crawl example