Jsoup-如何提取每个元素

更新时间：2023-12-03 21:14:16

如果只需要从文档中提取文本，再加上任何<b>或<i>标记(根据您的示例)，请考虑使用白名单类(请参见 docs ):

If you only need to extract the text from a document, plus any <b> or <i> tags (as per your example), consider using the Whitelist class (see docs):

String html = "<body><p class='default'> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <b>Hello World</b> </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> , Testing </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <i><b>Font </b></i> </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> Style </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <i>Check</i> </span> <span style='color: #000000; font-size: 10pt; font-family: MyriadPro-Bold;'> </span> </p></body>";

Whitelist wl = Whitelist.simpleText();
wl.addTags("b", "i"); // add additional tags here as necessary
String clean = Jsoup.clean(html, wl);
System.out.println(clean);

将输出(根据您的示例):

Which will output (as per your example):

11-07 19:04:45.738: I/System.out(318): <b>Hello World</b>   , Testing   
11-07 19:04:45.738: I/System.out(318): <i><b>Font </b></i>   Style   
11-07 19:04:45.738: I/System.out(318): <i>Check</i>

更新:

Update:

ArrayList<String> elements = new ArrayList<String>();

Elements e = doc.select("span");

for (int i = 0; i < e.size(); i++) {
    elements.add(e.get(i).html());
}

上一篇 : ：删除每 4 行末尾的逗号下一篇 : SED在末尾添加新行

Jsoup-如何提取每个元素

相关阅读

推荐文章