更新时间:2023-10-05 10:37:04
蜘蛛逻辑似乎不正确.
我快速浏览了您的网站,似乎有几种类型的页面:
I had a quick look at your website, and seems there are several types of pages:
因此正确的逻辑是:先获取#1 页面,然后获取#2 页面,然后我们可以下载那些#3 页面.
但是,您的蜘蛛尝试直接从 #1 页面中提取指向 #3 页面的链接.
Thus the correct logic looks like: get the #1 page first, get #2 pages then, and we could download those #3 pages.
However your spider tries to extract links to #3 pages directly from the #1 page.
已
我已经更新了你的代码,这里有一些实际有效的东西:
I have updated your code, and here's something that actually works:
import urlparse
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
name = "pwc_tax"
allowed_domains = ["www.pwc.com"]
start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]
def parse(self, response):
for href in response.css('div#all_results h3 a::attr(href)').extract():
yield Request(
url=response.urljoin(href),
callback=self.parse_article
)
def parse_article(self, response):
for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
yield Request(
url=response.urljoin(href),
callback=self.save_pdf
)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)