且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何从PDF文件中删除所有图像/绘图并仅以Java格式保留文本?

更新时间:2023-12-05 15:36:28

我使用了Apache PDFBox类似的情况。

I used Apache PDFBox in similar situation.

为了更具体一点,尝试类似的事情:

To be a little bit more specific, try something like that:

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import java.io.IOException;

public class Main {
    public static void main(String[] argv) throws COSVisitorException, InvalidPasswordException, CryptographyException, IOException {
        PDDocument document = PDDocument.load("input.pdf");

        if (document.isEncrypted()) {
            document.decrypt("");
        }

        PDDocumentCatalog catalog = document.getDocumentCatalog();
        for (Object pageObj :  catalog.getAllPages()) {
            PDPage page = (PDPage) pageObj;
            PDResources resources = page.findResources();
            resources.getImages().clear();
        }

        document.save("strippedOfImages.pdf");
    }
}

它应该删除所有类型的图像(png, jpeg,...)。它应该是这样的:

It's supposed to remove all types of images (png, jpeg, ...). It should work like that:

示例文章http:// s3 .postimage.org / 28f6boykk / before.jpg