且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在动态HTML网站上使用Beautiful Soup进行网页抓取的问题

更新时间:2022-12-23 17:36:51

该页面通过Ajax动态加载.查看网络检查器,该页面从位于 https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets .要加载所有作业数据,可以使用以下脚本:

The page is loading dynamically through Ajax. Looking at network inspector, the page loads all data from very big JSON file located at https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets. To load all job data, you can use this script:

url = "https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets"

import requests
import json

headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get(url, headers=headers)
data = json.loads(r.text)

# For printing all data in pretty form uncoment this line:
# print(json.dumps(data, indent=4, sort_keys=True))

for d in data:
    print(f'ID:\t{d["ID"]}')
    print(f'Job Title:\t{d["JobTitle"]}')
    print(f'Created:\t{d["Created"]}')
    print('*' * 80)

# Available keys in this JSON:
# ClassName
# LastEdited
# Created
# ANZSCO
# JobTitle
# Description
# WorkTasks
# WorkEnvironment
# PhysicalMentalDemands
# Comments
# EntryRequirements
# Group
# ID
# RecordClassName

此打印:

ID: 2327
Job Title:  Watch and Clock Maker and Repairer   
Created:    2017-07-11 11:33:52
********************************************************************************
ID: 2328
Job Title:  Web Administrator
Created:    2017-07-11 11:33:52
********************************************************************************
ID: 2329
Job Title:  Welder 
Created:    2017-07-11 11:33:52

...and so on

在脚本中,我写了一些可用的密钥,您可以使用这些密钥来访问您的特定作业数据.

In the script I wrote available keys you can use to access your specific job data.