如何根据大小限制拆分PDF?

更新时间：2023-01-11 16:37:42

想象一个包含十个页面和以下对象的3000 KB文档:

Imagine a 3000 KB document with ten pages and the following objects:

每个页面上使用四个字体子集，每个子集约50 KB
可在一张页面上显示10张图像，每张图像约200 KB(每页一张图像)
每页上显示四张图像，每张图像约50 KB
十个页面，每个页面的内容流约为25 KB
大约350 KB用于对象，例如目录，信息字典，页面树，交叉引用表等...

一个页面至少需要: -四个字体子集:4倍50 KB -单张图片:1次200 KB -四个图像:4倍50 KB -单个内容流:1次50 KB -稍微减少的交叉引用表，稍稍减少的页面树，几乎相同的目录，大小相同的信息字典，... 200 KB

A single page will need at least: - the four font subsets: 4 times 50 KB - the single image: 1 time 200 KB - the four images: 4 times 50 KB - a single content stream: 1 time 50 KB - a slightly reduced cross-reference table, a slightly reduced page tree, an almost identical catalog, an info dictionary of identical size,... 200 KB

总共是850 KB.这意味着，如果将一个10页的3000 KB PDF文档分成10个单独的页面，则最终将产生8500 KB(10次为850 KB)的结果.

Together that's 850 KB. This means that you end up with 8500 KB (10 times 850 KB) if you split up a 10-page 3000 KB PDF document into 10 separate pages.

此示例是猜测工作的结果(基于经验)，并且假定PDF是可预测的.大多数PDF不是:

This example is the result of guess work (based on experience) and it assumes that the PDF is predictable. Most PDFs aren't:

某些页面将需要高清图像(甚至可能是兆字节)，其他页面将没有任何图像，
某些页面将需要许多不同的字体和字体子集(千字节)，而其他页面将仅包含一些矢量绘图(压缩后为微小的内容流).
不同的页面可以共享大量资源(Form XObjects，Image XObjects ...)，其他页面则不会共享任何资源.
依此类推...

您在编写时注意到自己:我可以按页面拆分该文档.但这也不是一个好的解决方案，因为页面大小也不均匀地分布在各个页面上.

You have noticed that yourself, as you write: I can split that document by pages. But that is also not a good solution as the pagesize is also not evenly distributed across the pages.

这就是为什么您的问题只能有以下答案的原因:您必须进行试错.在查看内容之前，没有软件可以预测页面需要多少空间.该页面需要的.

That's exactly why your question can have no other answer than: you'll have to do trial and error. No software can predict how much space is needed by a page before you look at what is needed by that page.

更新:

正如David在评论中指出的那样，可以计算页面所需的所有资源，并检查当前资源和所需资源是否超过最大文件大小.

As David indicates in the comments, it is possible to calculate all the resources needed for a page, and to check if the current resources plus the needed resources exceed the maximum file size.

我写了一个小例子:

public void manipulatePdf(String src, String dest)
    throws IOException, DocumentException {
    Document document = new Document();
    PdfCopy copy = new PdfSmartCopy(document, new FileOutputStream(dest));
    document.open();
    PdfReader reader = new PdfReader(src);
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        // check resources needed for reader.getPageN(i);
        copy.addPage(copy.getImportedPage(reader, i));
        System.out.println("After adding page: " + copy.getOs().getCounter());
    }
    document.close();
    System.out.println("After closing document: " + copy.getOs().getCounter());
    reader.close();
}

我已经在18页的PDF样本上执行了该示例，这是输出:

I have executed the example on a PDF sample with 18 pages and this was the output:

After adding page: 56165
After adding page: 111398
After adding page: 162691
After adding page: 210035
After adding page: 253419
After adding page: 273429
After adding page: 330696
After adding page: 351564
After adding page: 400351
After adding page: 456545
After adding page: 495321
After adding page: 523640
After adding page: 576468
After adding page: 633525
After adding page: 751504
After adding page: 907490
After adding page: 957164
After adding page: 999140
After closing document: 1002509

您会看到副本的文件大小随添加的每个页面逐渐增加.添加所有页面后，大小为999140字节，然后写入页面树和交叉引用流，再添加3369字节.

You see how the file size of the copy gradually grows with each page that is added. After all pages are added, the size is 999140 bytes, and then the page tree and cross-reference stream are written, adding another 3369 bytes.

在显示// check resources needed for reader.getPageN(i);的地方，您可以估算出要为该页面添加的大小，如果超过最大值，则会跳出循环.

Where it says // check resources needed for reader.getPageN(i);, you could make a guesstimate of the size that will be added for the page and break out of the loop if it exceeds a maximum value.

为什么会这样猜测:

您可能正在计算已经添加的对象.如果您跟踪对象(不是那么困难)，您的猜测将更加准确.
我正在使用PdfSmartCopy.假设您的PDF中有两个相同的对象.不良的PDF软件通常会导致此类问题.例如:相同的图像字节被添加到文件两次. PdfSmartCopy可以检测到这一点，并将重用它遇到的第一个对象，而不是添加额外对象的冗余字节.

You could be counting objects that are already added. If you keep track of the objects (not that difficult), your guess will be more accurate.
I'm using PdfSmartCopy. Suppose that there are two identical objects inside your PDF. Bad PDF software often causes such problems. For instance: the same image bytes are added twice to the file. PdfSmartCopy can detect this and will reuse the first object it encounters instead of adding the redundant bytes of the extra object.

我们目前在PdfReader中没有reader.getTotalPageBytes()，因为PdfReader尝试使用尽可能少的内存.只要不需要这些对象，它就不会将任何对象加载到内存中.因此，在导入页面之前，它不知道每个对象的大小.

We currently don't have a reader.getTotalPageBytes() in PdfReader because PdfReader tries to use as little memory as possible. It won't load any objects into memory as long as these objects aren't needed. Hence it doesn't know the size of each object before the page is imported.

但是，我将确保在下一个版本中添加了这种方法.

However, I'll make sure that such a method is added in the next release.

更新:

在下一版本中，您将找到一个名为 SmartPdfSplitter 的工具.取决于名为 PdfResourceCounter 的新类.您可以像这样使用它:

In the next version, you'll find a tool named SmartPdfSplitter that depends on a new class named PdfResourceCounter. You can use it like this:

PdfReader reader = new PdfReader(src);
SmartPdfSplitter splitter = new SmartPdfSplitter(reader);
int part = 1;
while (splitter.hasMorePages()) {
    splitter.split(new FileOutputStream("results/merge/part_" + part + ".pdf"), 200000);
    part++;
}
reader.close();

请注意，如果无法将单页减少到更少的字节数，则单页PDF可能会超出限制(在代码示例中设置为200000字节).在这种情况下，splitter.isOverSized()将返回true，您将不得不寻找另一种减少PDF的方法.

Note that this can result in a single-page PDF that exceeds the limit (which was set to 200000 bytes in the code sample) in case that single page can not be reduced to less bytes. In that case, splitter.isOverSized() will return true and you'll have to find another way to reduce the PDF.

上一篇 : ：Hadoop 进程记录如何跨块边界拆分?下一篇 : HADOOP - 1.2.1稳定的字数统计实例

如何根据大小限制拆分PDF?

相关阅读

技术问答最新文章