且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Jsoup没有下载整个页面

更新时间:2023-12-03 20:53:10

问题是内部的Jsoup Http连接处理.选择器引擎没有问题. 我没有深入研究,但是处理HTTP连接的专有方式始终存在问题.我建议将其替换为HttpClient-

The problem is the internal Jsoup Http Connection Handling. Nothing wrong with the selector engine. I didn't go deep in but there always problem with proprietary way to handle http connection. I would recommend to replace it with HttpClient - http://hc.apache.org/ . If you can't add http client as dependencies, you might want to check Jsoup source code in handling http connection. The issue is the default maxBodySize of Jsoup.Connection. Please refer to updated answer. *I still keep HttpClient code as sample. Output of the program

  • 从文件加载= 1452
  • 从http客户端加载= 1452
  • 从jsoup connect加载= 1350
  • 使用maxBodySize = 1452从jsoup连接加载

  • load from file= 1452
  • load from http client= 1452
  • load from jsoup connect= 1350
  • load from jsoup connect using maxBodySize= 1452

package test;

import java.io.IOException;
import java.io.InputStream;

import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class TestJsoup {

    /**
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.parse(loadContentFromClasspath(), "UTF8", "");
        Elements es = doc.getElementsByClass("tr_normal");
        System.out.println("load from file= " + es.size());

        doc = Jsoup.parse(loadContentByHttpClient(), "UTF8", "");
        es = doc.getElementsByClass("tr_normal");
        System.out.println("load from http client= " + es.size());

        String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                + "/stockcode/eisdeqty_pf.htm";
        doc = Jsoup.connect(url).get();
        es = doc.getElementsByClass("tr_normal");
        System.out.println("load from jsoup connect= " + es.size());

        int maxBodySize = 2048000;//2MB (default is 1MB) 0 for unlimited size
        doc = Jsoup.connect(url).maxBodySize(maxBodySize).get();
        es = doc.getElementsByClass("tr_normal");
        System.out.println("load from jsoup connect using maxBodySize= " + es.size());
    }

    public static InputStream loadContentByHttpClient()
            throws ClientProtocolException, IOException {
        String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                + "/stockcode/eisdeqty_pf.htm";
        HttpClient client = HttpClientBuilder.create().build();
        HttpGet request = new HttpGet(url);
        HttpResponse response = client.execute(request);
        return response.getEntity().getContent();
    }

    public static InputStream loadContentFromClasspath()
            throws ClientProtocolException, IOException {
        return TestJsoup.class.getClassLoader().getResourceAsStream(
                "eisdeqty_pf.htm");
    }

}