更新时间:2023-12-03 20:18:34
parseBodyFragment()
as well as all other parse()
-methods use a HTML parser by default. And those add always the HTML-Shell (<html>…</html>
, <head>…</head>
etc.).
Just don't use a HTML-parser, use a XML-parser instead ;-)
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Replace that single line and your problem is solved.
final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
Document docHtml = Jsoup.parse(html);
Document docXml = Jsoup.parse(html, "", Parser.xmlParser());
System.out.println("******* HTML *******\n" + docHtml);
System.out.println();
System.out.println("******* XML *******\n" + docXml);
Output:
******* HTML *******
<html>
<head></head>
<body>
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
</body>
</html>
******* XML *******
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>