且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何检查PDF页面是否为PDFBOX,XPDF的图像(已扫描)

更新时间:2023-11-25 12:40:34

正确提取图像

随着更新的PDF清楚地表明,问题在于它在页面上没有即时图像,但是在其上绘制了包含图像的表格xobject.因此,图像搜索必须递归为xobjects形式.

Extract images properly

As the updated PDF makes clear the problem is that it does not have any images immediately on the page but it has form xobjects drawn onto it which do contain images. Thus, the image search has to recurse into the form xobjects.

这还不是全部:更新的PDF中的所有页面共享相同的资源字典,它们只是选择了不同形式的xobjects来显示.因此,实际上必须解析相应的页面内容流,以确定给定页面上存在哪个xobject(带有哪些图像).

And that is not all: All pages in the updated PDF share the same resources dictionary, they merely pick a different of its form xobjects to display. Thus, one really has to parse the respective page content stream to determine which xobject (with which images) is present on a given page.

实际上,这是PDFBox工具ExtractImages的功能.遗憾的是,尽管如此,它没有显示发现有问题图像的页面,请参见.

Actually this is something the PDFBox tool ExtractImages does. Unfortunately, though, it does not show the page it found the image in question on, cf. the ExtractImages.java test method testExtractPageImagesTool10948New.

但是我们可以简单地借用该工具使用的技术:

But we can simply borrow from the technique used by that tool:

PDDocument document = PDDocument.load(resource);
int page = 1;
for (final PDPage pdPage : document.getPages())
{
    final int currentPage = page;
    PDFGraphicsStreamEngine pdfGraphicsStreamEngine = new PDFGraphicsStreamEngine(pdPage)
    {
        int index = 0;
        
        @Override
        public void drawImage(PDImage pdImage) throws IOException
        {
            if (pdImage instanceof PDImageXObject)
            {
                PDImageXObject image = (PDImageXObject)pdImage;
                File file = new File(RESULT_FOLDER, String.format("10948-new-engine-%s-%s.%s", currentPage, index, image.getSuffix()));
                ImageIOUtil.writeImage(image.getImage(), image.getSuffix(), new FileOutputStream(file));
                index++;
            }
        }

        @Override
        public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException { }

        @Override
        public void clip(int windingRule) throws IOException { }

        @Override
        public void moveTo(float x, float y) throws IOException {  }

        @Override
        public void lineTo(float x, float y) throws IOException { }

        @Override
        public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {  }

        @Override
        public Point2D getCurrentPoint() throws IOException { return null; }

        @Override
        public void closePath() throws IOException { }

        @Override
        public void endPath() throws IOException { }

        @Override
        public void strokePath() throws IOException { }

        @Override
        public void fillPath(int windingRule) throws IOException { }

        @Override
        public void fillAndStrokePath(int windingRule) throws IOException { }

        @Override
        public void shadingFill(COSName shadingName) throws IOException { }
    };
    pdfGraphicsStreamEngine.processPage(pdPage);
    page++;
}

(

(ExtractImages.java test method testExtractPageImages10948New)

此代码输出文件名为"10948-new-engine-1-0.tiff","10948-new-engine-2-0.tiff","10948-new-engine-3-"的图像0.tiff"和"10948-new-engine-4-0.tiff",即每页一个.

This code outputs images with file names "10948-new-engine-1-0.tiff", "10948-new-engine-2-0.tiff", "10948-new-engine-3-0.tiff", and "10948-new-engine-4-0.tiff", i.e. one per page.

PS::请记住在类路径中包含com.github.jai-imageio:jai-imageio-core,这对于TIFF输出是必需的.

PS: Please remember to include com.github.jai-imageio:jai-imageio-core in your classpath, it is required for TIFF output.

OP的另一个问题是图像有时会上下颠倒,例如如果是他现在最新的示例文件"t1_edited.pdf".原因是这些图像确实以PDF图像资源的形式上下颠倒存储.

Another issue of the OP was that the images sometimes appear flipped upside-down, e.g. in case of his now newest sample file "t1_edited.pdf". The reason is that those images indeed are stored upside-down as image resources in the PDF.

将这些图像绘制到页面上时,当时有效的当前转换矩阵会镜像垂直绘制的图像,从而产生预期的外观.

When those images are drawn onto a page, the current transformation matrix in effect at that time mirrors the image drawn vertically and so creates the expected appearance.

通过略微增强上面代码中的drawImage实现,可以在导出的图像名称中包括这种翻转的指示符:

By slightly enhancing the drawImage implementation in the code above, one can include indicators of such flips in the names of the exported images:

public void drawImage(PDImage pdImage) throws IOException
{
    if (pdImage instanceof PDImageXObject)
    {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        String flips = "";
        if (ctm.getScaleX() < 0)
            flips += "h";
        if (ctm.getScaleY() < 0)
            flips += "v";
        if (flips.length() > 0)
            flips = "-" + flips;
        PDImageXObject image = (PDImageXObject)pdImage;
        File file = new File(RESULT_FOLDER, String.format("t1_edited-engine-%s-%s%s.%s", currentPage, index, flips, image.getSuffix()));
        ImageIOUtil.writeImage(image.getImage(), image.getSuffix(), new FileOutputStream(file));
        index++;
    }
}

现在已相应地标记了垂直或水平翻转的图像.

Now vertically or horizontally flipped images are marked accordingly.