我如何使用与Apache提卡HTML解析器在Java中提取所有的HTML标签？

更新时间：2023-02-18 09:53:36

你想一个HTML文件的纯文本版本？如果是这样，你需要的是这样的：

Do you want a plain text version of a html file? If so, all you need is something like:

        InputStream input = new FileInputStream("myfile.html");
        ContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        new HtmlParser().parse(input, handler, metadata, new ParseContext());
        String plainText = handler.toString();

该BodyContentHandler，当不带参数的构造函数或用字符限制创建的，将捕获的html正文的文本（只），并将其返还给您。

The BodyContentHandler, when created with no constructor arguments or with a character limit, will capture the text (only) of the body of the html and return it to you.

上一篇 : ：将带有嵌套括号的字符串转换为嵌套列表python下一篇 : Python正则表达式删除大括号内的子字符串

我如何使用与Apache提卡HTML解析器在Java中提取所有的HTML标签？

相关阅读

技术问答最新文章