更新时间:2023-12-03 20:53:10
问题是内部的Jsoup Http连接处理.选择器引擎没有问题. 我没有深入研究,但是处理HTTP连接的专有方式始终存在问题.我建议将其替换为HttpClient-
The problem is the internal Jsoup Http Connection Handling. Nothing wrong with the selector engine. I didn't go deep in but there always problem with proprietary way to handle http connection. I would recommend to replace it with HttpClient - http://hc.apache.org/ . If you can't add http client as dependencies, you might want to check Jsoup source code in handling http connection.
The issue is the default maxBodySize of Jsoup.Connection. Please refer to updated answer. *I still keep HttpClient code as sample.
Output of the program
使用maxBodySize = 1452从jsoup连接加载
load from jsoup connect using maxBodySize= 1452
package test;
import java.io.IOException;
import java.io.InputStream;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class TestJsoup {
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
Document doc = Jsoup.parse(loadContentFromClasspath(), "UTF8", "");
Elements es = doc.getElementsByClass("tr_normal");
System.out.println("load from file= " + es.size());
doc = Jsoup.parse(loadContentByHttpClient(), "UTF8", "");
es = doc.getElementsByClass("tr_normal");
System.out.println("load from http client= " + es.size());
String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
+ "/stockcode/eisdeqty_pf.htm";
doc = Jsoup.connect(url).get();
es = doc.getElementsByClass("tr_normal");
System.out.println("load from jsoup connect= " + es.size());
int maxBodySize = 2048000;//2MB (default is 1MB) 0 for unlimited size
doc = Jsoup.connect(url).maxBodySize(maxBodySize).get();
es = doc.getElementsByClass("tr_normal");
System.out.println("load from jsoup connect using maxBodySize= " + es.size());
}
public static InputStream loadContentByHttpClient()
throws ClientProtocolException, IOException {
String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
+ "/stockcode/eisdeqty_pf.htm";
HttpClient client = HttpClientBuilder.create().build();
HttpGet request = new HttpGet(url);
HttpResponse response = client.execute(request);
return response.getEntity().getContent();
}
public static InputStream loadContentFromClasspath()
throws ClientProtocolException, IOException {
return TestJsoup.class.getClassLoader().getResourceAsStream(
"eisdeqty_pf.htm");
}
}