且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

我如何使用与Apache提卡HTML解析器在Java中提取所有的HTML标签?

更新时间:2023-02-18 09:53:36

你想一个HTML文件的纯文本版本?如果是这样,你需要的是这样的:

Do you want a plain text version of a html file? If so, all you need is something like:

        InputStream input = new FileInputStream("myfile.html");
        ContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        new HtmlParser().parse(input, handler, metadata, new ParseContext());
        String plainText = handler.toString();

该BodyContentHandler,当不带参数的构造函数或用字符限制创建的,将捕获的html正文的文本(只),并将其返还给您。

The BodyContentHandler, when created with no constructor arguments or with a character limit, will capture the text (only) of the body of the html and return it to you.