且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从 <script> 获取数据使用 Scrapy 在 HTML 中标记

更新时间:2023-02-19 17:07:45

问题是所需的数据在 Javascript 代码中.而且,您目前依赖行索引的方法非常脆弱且不可靠.

The problem is that the desired data is inside the Javascript code. And, your current approach where you rely on line indexes is quite fragile and unreliable.

想法是定位包含所需数据的 script 标签,使用 正则表达式 获取包含价格的对象/字典,在 json 模块 并获取所需的信息.

The idea is to locate the script tag containing the desired data, use regular expressions to get to the object/dictionary containing prices, load the object into a python dictionary with the help of json module and get the desired information.

来自 Scrapy Shell 的演示:

In [1]: import re
In [2]: import json

In [3]: pattern = re.compile(r"KBB.Vehicle.Pages.PricingOverview.Buyers.setup(.*?data: ({.*?}),W+adPriceRanges", re.MULTILINE | re.DOTALL)
In [4]: data = response.xpath("//script[contains(., 'KBB.Vehicle.Pages.PricingOverview.Buyers.setup')]/text()").re(pattern)[0]

In [5]: data = data.replace("//Workaround until we get cross domain working for Flash", "")

In [6]: data_obj = json.loads(data)

In [7]: data_obj['values']['fpp']
Out[7]: {u'price': 15569.0, u'priceMax': 17356.0, u'priceMin': 13781.0}

In [8]: data_obj['values']['retail']
Out[8]: {u'price': 16370.0, u'priceMax': 0.0, u'priceMin': 0.0}