且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用pyPDF从文档中检索页码

更新时间:2023-02-14 09:02:44

有关完整文档,请参见Adobe的978页

For full documentation, see Adobe's 978-page PDF Reference. :-)

更具体地说,PDF文件包含元数据,该元数据指示PDF的物理页面如何映射到逻辑页码以及应如何格式化页码.这是您获得规范结果的地方.示例2 此页面显示了它在PDF标记中的外观.您必须先将其剔除,解析并自己执行映射.

More specifically, the PDF file contains metadata that indicates how the PDF's physical pages are mapped to logical page numbers and how page numbers should be formatted. This is where you go for canonical results. Example 2 of this page shows how this looks in the PDF markup. You'll have to fish that out, parse it, and perform a mapping yourself.

在PyPDF中,要获取此信息,请尝试作为起点:

In PyPDF, to get at this information, try, as a starting point:

pdf.trailer["/Root"]["/PageLabels"]["/Nums"]

顺便说一句,当您看到一个IndirectObject实例时,可以调用其getObject()方法来检索所指向的实际对象.

By the way, when you see an IndirectObject instance, you can call its getObject() method to retrieve the actual object being pointed to.

正如您所说,您的替代方法是检查文本对象并尝试找出哪个是页码.您可以为此使用page对象的extractText(),但是您将返回一个字符串,并且必须尝试从中找出页码. (当然,页码可能是罗马或字母而不是数字,有些页面可能没有编号.)相反,请看看extractText()的实际工作方式(毕竟PyPDF是用Python编写的),以及使用它作为例程的基础,该例程将单独检查页面上的每个文本对象以查看其是否像页码.警惕上面有很多页码的目录/索引页面!

Your alternative is, as you say, to check the text objects and try to figure out which is the page number. You could use extractText() of the page object for this, but you'll get one string back and have to try to fish out the page number from that. (And of course the page number might be Roman or alphabetic instead of numeric, and some pages may not be numbered.) Instead, have a look at how extractText() actually does its job—PyPDF is written in Python, after all—and use it as a basis of a routine that checks each text object on the page individually to see if it's like a page number. Be wary of TOC/index pages that have lots of page numbers on them!