更新时间:2023-01-11 16:37:42
想象一个包含十个页面和以下对象的3000 KB文档:
Imagine a 3000 KB document with ten pages and the following objects:
一个页面至少需要: -四个字体子集:4倍50 KB -单张图片:1次200 KB -四个图像:4倍50 KB -单个内容流:1次50 KB -稍微减少的交叉引用表,稍稍减少的页面树,几乎相同的目录,大小相同的信息字典,... 200 KB
A single page will need at least: - the four font subsets: 4 times 50 KB - the single image: 1 time 200 KB - the four images: 4 times 50 KB - a single content stream: 1 time 50 KB - a slightly reduced cross-reference table, a slightly reduced page tree, an almost identical catalog, an info dictionary of identical size,... 200 KB
总共是850 KB.这意味着,如果将一个10页的3000 KB PDF文档分成10个单独的页面,则最终将产生8500 KB(10次为850 KB)的结果.
Together that's 850 KB. This means that you end up with 8500 KB (10 times 850 KB) if you split up a 10-page 3000 KB PDF document into 10 separate pages.
此示例是猜测工作的结果(基于经验),并且假定PDF是可预测的.大多数PDF不是:
This example is the result of guess work (based on experience) and it assumes that the PDF is predictable. Most PDFs aren't:
您在编写时注意到自己:我可以按页面拆分该文档.但这也不是一个好的解决方案,因为页面大小也不均匀地分布在各个页面上.
You have noticed that yourself, as you write: I can split that document by pages. But that is also not a good solution as the pagesize is also not evenly distributed across the pages.
这就是为什么您的问题只能有以下答案的原因:您必须进行试错.在查看内容之前,没有软件可以预测页面需要多少空间.该页面需要的.
That's exactly why your question can have no other answer than: you'll have to do trial and error. No software can predict how much space is needed by a page before you look at what is needed by that page.
更新:
正如David在评论中指出的那样,可以计算页面所需的所有资源,并检查当前资源和所需资源是否超过最大文件大小.
As David indicates in the comments, it is possible to calculate all the resources needed for a page, and to check if the current resources plus the needed resources exceed the maximum file size.
我写了一个小例子:
public void manipulatePdf(String src, String dest)
throws IOException, DocumentException {
Document document = new Document();
PdfCopy copy = new PdfSmartCopy(document, new FileOutputStream(dest));
document.open();
PdfReader reader = new PdfReader(src);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// check resources needed for reader.getPageN(i);
copy.addPage(copy.getImportedPage(reader, i));
System.out.println("After adding page: " + copy.getOs().getCounter());
}
document.close();
System.out.println("After closing document: " + copy.getOs().getCounter());
reader.close();
}
我已经在18页的PDF样本上执行了该示例,这是输出:
I have executed the example on a PDF sample with 18 pages and this was the output:
After adding page: 56165
After adding page: 111398
After adding page: 162691
After adding page: 210035
After adding page: 253419
After adding page: 273429
After adding page: 330696
After adding page: 351564
After adding page: 400351
After adding page: 456545
After adding page: 495321
After adding page: 523640
After adding page: 576468
After adding page: 633525
After adding page: 751504
After adding page: 907490
After adding page: 957164
After adding page: 999140
After closing document: 1002509
您会看到副本的文件大小随添加的每个页面逐渐增加.添加所有页面后,大小为999140字节,然后写入页面树和交叉引用流,再添加3369字节.
You see how the file size of the copy gradually grows with each page that is added. After all pages are added, the size is 999140 bytes, and then the page tree and cross-reference stream are written, adding another 3369 bytes.
在显示// check resources needed for reader.getPageN(i);
的地方,您可以估算出要为该页面添加的大小,如果超过最大值,则会跳出循环.
Where it says // check resources needed for reader.getPageN(i);
, you could make a guesstimate of the size that will be added for the page and break out of the loop if it exceeds a maximum value.
为什么会这样猜测:
PdfSmartCopy
.假设您的PDF中有两个相同的对象.不良的PDF软件通常会导致此类问题.例如:相同的图像字节被添加到文件两次. PdfSmartCopy
可以检测到这一点,并将重用它遇到的第一个对象,而不是添加额外对象的冗余字节.PdfSmartCopy
. Suppose that there are two identical objects inside your PDF. Bad PDF software often causes such problems. For instance: the same image bytes are added twice to the file. PdfSmartCopy
can detect this and will reuse the first object it encounters instead of adding the redundant bytes of the extra object.我们目前在PdfReader
中没有reader.getTotalPageBytes()
,因为PdfReader
尝试使用尽可能少的内存.只要不需要这些对象,它就不会将任何对象加载到内存中.因此,在导入页面之前,它不知道每个对象的大小.
We currently don't have a reader.getTotalPageBytes()
in PdfReader
because PdfReader
tries to use as little memory as possible. It won't load any objects into memory as long as these objects aren't needed. Hence it doesn't know the size of each object before the page is imported.
但是,我将确保在下一个版本中添加了这种方法.
However, I'll make sure that such a method is added in the next release.
更新:
在下一版本中,您将找到一个名为 SmartPdfSplitter
的工具.取决于名为 PdfResourceCounter
的新类.您可以像这样使用它:
In the next version, you'll find a tool named SmartPdfSplitter
that depends on a new class named PdfResourceCounter
. You can use it like this:
PdfReader reader = new PdfReader(src);
SmartPdfSplitter splitter = new SmartPdfSplitter(reader);
int part = 1;
while (splitter.hasMorePages()) {
splitter.split(new FileOutputStream("results/merge/part_" + part + ".pdf"), 200000);
part++;
}
reader.close();
请注意,如果无法将单页减少到更少的字节数,则单页PDF可能会超出限制(在代码示例中设置为200000
字节).在这种情况下,splitter.isOverSized()
将返回true
,您将不得不寻找另一种减少PDF的方法.
Note that this can result in a single-page PDF that exceeds the limit (which was set to 200000
bytes in the code sample) in case that single page can not be reduced to less bytes. In that case, splitter.isOverSized()
will return true
and you'll have to find another way to reduce the PDF.