更新时间:2022-09-15 12:24:17
其实我只是想试试爬取图片而已,先看看网页,需要爬的地方有两个,一是封面图,二是下载地址,挺简单的
Item定义:
1
2
3
4
5
6
7
8
9
10
|
import scrapy
class TiantianmeijuItem(scrapy.Item):
name = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
image_paths = scrapy.Field()
episode = scrapy.Field()
episode_url = scrapy.Field()
|
name是保存名字
image_urls和images 是爬取图片的pipeline用的,一个是保存图片URL,一个是保存图片存放信息
image_paths其实没什么实际作用,只是记录下载成功的图片地址
epiosde和episode_url是保存集数和对应下载地址
Spider:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
import scrapy
from tiantianmeiju.items import TiantianmeijuItem
import sys
reload (sys) # Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入
sys.setdefaultencoding( 'utf-8' )
class CacthUrlSpider(scrapy.Spider):
name = 'meiju'
allowed_domains = [ 'cn163.net' ]
start_urls = [ "http://cn163.net/archives/{id}/" . format ( id = id ) for id in [ '16355' , '13470' , '18766' , '18805' ]]
def parse( self , response):
item = TiantianmeijuItem()
item[ 'name' ] = response.xpath( '//*[@id="content"]/div[2]/div[2]/h2/text()' ).extract()
item[ 'image_urls' ] = response.xpath( '//*[@id="entry"]/div[2]/img/@src' ).extract()
item[ 'episode' ] = response.xpath( '//*[@id="entry"]/p[last()]/a/text()' ).extract()
item[ 'episode_url' ] = response.xpath( '//*[@id="entry"]/p[last()]/a/@href' ).extract()
yield item
|
页面比较简单
Pipelines:这里写了两个管道,一个是把下载链接保存到文件,一个是下载图片
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
import json
import os
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from settings import IMAGES_STORE
class TiantianmeijuPipeline( object ):
def process_item( self , item, spider):
return item
class WriteToFilePipeline( object ):
def process_item( self , item, spider):
item = dict (item)
FolderName = item[ 'name' ][ 0 ].replace( '/' , '')
downloadFile = 'download_urls.txt'
with open (os.path.join(IMAGES_STORE, FolderName, downloadFile), 'w' ) as file :
for name,url in zip (item[ 'episode' ], item[ 'episode_url' ]):
file .write( '{name}: {url}\n' . format (name = name, url = url))
return item
class MyImagesPipeline(ImagesPipeline):
def get_media_requests( self , item, info):
for image_url in item[ 'image_urls' ]:
yield Request(image_url, meta = { 'item' : item})
def item_completed( self , results, item, info):
image_paths = [x[ 'path' ] for ok,x in results if ok]
if not image_paths:
raise DropItem( "Item contains no images" )
item[ 'image_paths' ] = image_paths
return item
def file_path( self , request, response = None , info = None ):
item = request.meta[ 'item' ]
FolderName = item[ 'name' ][ 0 ].replace( '/' , '')
image_guid = request.url.split( '/' )[ - 1 ]
filename = u '{}/{}' . format (FolderName, image_guid)
return filename
|
get_media_requests和item_completed,因为默认的图片储存路径是
<IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg,
我需要把full改成以美剧名字目录来保存,所以重写了file_path
settings打开pipelines相关配置:
1
2
3
4
5
6
7
8
|
ITEM_PIPELINES = {
'tiantianmeiju.pipelines.WriteToFilePipeline' : 2 ,
'tiantianmeiju.pipelines.MyImagesPipeline' : 1 ,
} IMAGES_STORE = os.path.join(os.getcwd(), 'image' ) # 图片存储路径
IMAGES_EXPIRES = 90
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
|
爬下来之后就是这个效果了:
本文转自运维笔记博客51CTO博客,原文链接http://blog.51cto.com/lihuipeng/1713531如需转载请自行联系原作者
lihuipeng