且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Jsoup-如何提取每个元素

更新时间:2023-12-03 21:14:16

如果只需要从文档中提取文本,再加上任何<b><i>标记(根据您的示例),请考虑使用白名单类(请参见 docs ):

If you only need to extract the text from a document, plus any <b> or <i> tags (as per your example), consider using the Whitelist class (see docs):

String html = "<body><p class='default'> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <b>Hello World</b> </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> , Testing </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <i><b>Font </b></i> </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> Style </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <i>Check</i> </span> <span style='color: #000000; font-size: 10pt; font-family: MyriadPro-Bold;'> </span> </p></body>";

Whitelist wl = Whitelist.simpleText();
wl.addTags("b", "i"); // add additional tags here as necessary
String clean = Jsoup.clean(html, wl);
System.out.println(clean);  

将输出(根据您的示例):

Which will output (as per your example):

11-07 19:04:45.738: I/System.out(318): <b>Hello World</b>   , Testing   
11-07 19:04:45.738: I/System.out(318): <i><b>Font </b></i>   Style   
11-07 19:04:45.738: I/System.out(318): <i>Check</i>


更新:


Update:

ArrayList<String> elements = new ArrayList<String>();

Elements e = doc.select("span");

for (int i = 0; i < e.size(); i++) {
    elements.add(e.get(i).html());
}