
且构网 - 分享程序员编程开发的那些事


更新时间:2023-09-02 23:12:04


There are several issues with your script that prevent successful scrape.

要选中一个复选框,您无需再次设置其值(它已经在HTML中设置!),请将其checked属性设置为 true :

To check a checkbox, you don't set its value again (it's already set in HTML!), you set its checked attribute to true:

document.getElementById('crID%3a250').setAttribute("checked", true); // France


The button that submits the form is a hyperlink <a> which doesn't have a submit method, it should be clicked (it even has onClick function in the code)

 document.getElementById('ctl00_main_filters_anchorApplyBottom').click(); // submit the form


**The search request ** is sent through ajax and takes time to complete, so your script should wait for at least a second vefore trying to fetch the data. I'll show how to wait in the full working code below.


Next, you may get only the table data, no need to sip through all th HTML:

var result = await page.evaluate(function() {
    return document.querySelectorAll('.DataContainer table')[0].outerHTML; 


Here's a bit trimmed down version of you script with issues corrected:

var phantom = require('phantom');

var url = 'http://data.un.org/Data.aspx?q=population&d=PopDiv&f=variableID%3A12';

// A promise to wait for n of milliseconds
const timeout = ms => new Promise(resolve => setTimeout(resolve, ms));

(async function(req, res) {
    const instance = await phantom.create();
    const page = await instance.createPage();

    await page.on('onResourceRequested', function(requestData) {
        console.info('Requesting', requestData.url);
    await page.on('onConsoleMessage', function(msg) {

    const status = await page.open(url);
    await console.log('STATUS:', status);

    // submit
    await page.evaluate(function() {
        document.getElementById('crID%3a250').setAttribute("checked", true); // France
        document.getElementById('timeID%3a79').setAttribute("checked", true); // 2015
        document.getElementById('varID%3a2').setAttribute("checked", true); // Medium
        document.getElementById('ctl00_main_filters_anchorApplyBottom').click(); // click submit button

    console.log('Waiting 1.5 seconds..');    
    await timeout(1500);

    // Get only the table contents
    var result = await page.evaluate(function() {
        return document.querySelectorAll('.DataContainer table')[0].outerHTML; 
    await console.log('RESULT:', result);

    await instance.exit();


The last but not the least observation is that you could simply try to replay an ajax request made by the form and find out that the URL of search request works quite well on its own, when just opened in another tab:


You don't even need a headless browser to get it, just cUrl/requests and process. It happens with sites a lot, so it's useful to check network tab in your browser devtools before scraping.



And if there are so many results that they are scattered over several pages, there is one more parameter to be used in request: Page:

data.un.org/Handlers/DataHandler.ashx?Service=page& Page = 3 & DataFilter = variableID:12& DataMartId = PopDiv& UserQuery = population& c = 2,4 ,6,7& s = _crEngNameOrderBy:asc,_timeEngNameOrderBy:desc,_varEngNameOrderBy:asc& RequestId = 461
