且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用Phantom刮擦信息并提交表单

更新时间:2023-09-02 23:12:04

您的脚本存在多个问题,无法成功进行抓取.

There are several issues with your script that prevent successful scrape.

要选中一个复选框,您无需再次设置其值(它已经在HTML中设置!),请将其checked属性设置为 true :

To check a checkbox, you don't set its value again (it's already set in HTML!), you set its checked attribute to true:

document.getElementById('crID%3a250').setAttribute("checked", true); // France

提交表单的按钮是超链接<a>,它没有submit方法,应单击它(甚至在代码中具有onClick功能)

The button that submits the form is a hyperlink <a> which doesn't have a submit method, it should be clicked (it even has onClick function in the code)

 document.getElementById('ctl00_main_filters_anchorApplyBottom').click(); // submit the form

**搜索请求**是通过ajax发送的,需要花费一些时间才能完成,因此您的脚本应至少等待一秒钟,然后再尝试获取数据.我将在下面的完整工作代码中演示如何等待.

**The search request ** is sent through ajax and takes time to complete, so your script should wait for at least a second vefore trying to fetch the data. I'll show how to wait in the full working code below.

下一步,您可能只获取表数据,而无需浏览所有HTML:

Next, you may get only the table data, no need to sip through all th HTML:

var result = await page.evaluate(function() {
    return document.querySelectorAll('.DataContainer table')[0].outerHTML; 
});

以下是您的脚本的精简版本,已纠正了问题:

Here's a bit trimmed down version of you script with issues corrected:

var phantom = require('phantom');

var url = 'http://data.un.org/Data.aspx?q=population&d=PopDiv&f=variableID%3A12';

// A promise to wait for n of milliseconds
const timeout = ms => new Promise(resolve => setTimeout(resolve, ms));

(async function(req, res) {
    const instance = await phantom.create();
    const page = await instance.createPage();

    await page.on('onResourceRequested', function(requestData) {
        console.info('Requesting', requestData.url);
    });
    await page.on('onConsoleMessage', function(msg) {
        console.info(msg);
    });

    const status = await page.open(url);
    await console.log('STATUS:', status);

    // submit
    await page.evaluate(function() {
        document.getElementById('crID%3a250').setAttribute("checked", true); // France
        document.getElementById('timeID%3a79').setAttribute("checked", true); // 2015
        document.getElementById('varID%3a2').setAttribute("checked", true); // Medium
        document.getElementById('ctl00_main_filters_anchorApplyBottom').click(); // click submit button
    });

    console.log('Waiting 1.5 seconds..');    
    await timeout(1500);

    // Get only the table contents
    var result = await page.evaluate(function() {
        return document.querySelectorAll('.DataContainer table')[0].outerHTML; 
    });
    await console.log('RESULT:', result);

    await instance.exit();
})();


最后但并非最不重要的观察结果是,您可以简单地尝试重播表单提出的ajax请求,并找出


The last but not the least observation is that you could simply try to replay an ajax request made by the form and find out that the URL of search request works quite well on its own, when just opened in another tab:

您甚至不需要无头的浏览器即可获取它,只需单击URL/请求和处理即可.网站经常发生这种情况,因此在抓取之前检查浏览器devtools中的网络"选项卡很有用.

You don't even need a headless browser to get it, just cUrl/requests and process. It happens with sites a lot, so it's useful to check network tab in your browser devtools before scraping.

更新

如果结果太多以至于它们分散在多个页面上,则在请求中将使用另外一个参数:Page:

And if there are so many results that they are scattered over several pages, there is one more parameter to be used in request: Page:

data.un.org/Handlers/DataHandler.ashx?Service=page& Page = 3 & DataFilter = variableID:12& DataMartId = PopDiv& UserQuery = population& c = 2,4 ,6,7& s = _crEngNameOrderBy:asc,_timeEngNameOrderBy:desc,_varEngNameOrderBy:asc& RequestId = 461

data.un.org/Handlers/DataHandler.ashx?Service=page&Page=3&DataFilter=variableID:12&DataMartId=PopDiv&UserQuery=population&c=2,4,6,7&s=_crEngNameOrderBy:asc,_timeEngNameOrderBy:desc,_varEngNameOrderBy:asc&RequestId=461