且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将硒驱动程序传递给scrapy

更新时间:2023-11-22 23:50:58

正如你所说,scrapy 打开你的初始 url,而不是 Selenium 修改的页面.

As you said, scrapy opens your initial url, not the page modified by Selenium.

如果你想从 Selenium 获取页面,你应该使用 driver.page_source.encode('utf-8') (编码不是强制性的).您也可以将它与scrapy Selector 一起使用:

If you want to get page from Selenium, you should use driver.page_source.encode('utf-8') (encoding is not compulsory). You can also use it with scrapy Selector:

response = Selector(text=driver.page_source.encode('utf-8'))

像以前一样处理响应之后.

After it work with response as you used to.

我会尝试这样的事情(注意,我还没有测试过代码):

I would try something like this (notice, I haven't tested the code):

import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep

count = 0

class ContractSpider(scrapy.Spider):

    name = "contracts"

    def start_requests(self):
        urls = [
            'https://www.contractsfinder.service.gov.uk/Search/Results',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def __init__(self):
        driver = webdriver.Firefox()
        # An implicit wait tells WebDriver to poll the DOM for a certain amount of time when trying to find any element
        # (or elements) not immediately available.
        driver.implicitly_wait(5)

    @staticmethod
    def get__response(url):
        self.driver.get("url")
        elem2 = self.driver.find_element_by_name("open")
        elem2.click()
        elem = self.driver.find_element_by_name("awarded")
        elem.click()
        elem3 = self.driver.find_element_by_id("awarded_date")
        elem3.click()
        elem4 = self.driver.find_element_by_name("awarded_from")
        elem4.send_keys("01/03/2018")
        elem4.send_keys(Keys.RETURN)
        elem5 = self.driver.find_element_by_name("awarded_to")
        elem5.send_keys("16/03/2018")
        elem5.send_keys(Keys.RETURN)
        elem6 = self.driver.find_element_by_name("adv_search")
        self.driver.execute_script("arguments[0].scrollIntoView(true);", elem6)
        elem6.send_keys(Keys.RETURN)
        return self.driver.page_source.encode('utf-8')

    def parse(self, response):
        global count
        count += 1
        strcount = str(count)
        # Here you got response from webdriver
        # you can use selectors to extract data from it
        selenium_response = Selector(text=self.get_selenium_response(response.url))
    ...