且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在Python中使用正则表达式解析PDF文件

更新时间:2023-09-02 17:50:28

如果仅使用正则表达式,则很容易构造一个程序无法处理的PDF文件. PDF词典和列表可以包含其他对象.正则表达式不能处理递归结构,至少不能处理Python re模块.

If you are using only regex, it is easy to construct a PDF file that your program will not be able to handle. PDF dictionaries and lists can contain other objects. Regex can't handle recursive structures, at least not Python re module.

pdf文件是对象和流的树:

A pdf file is a tree of objects and streams:

  • 字典:<<(名称值)* >>
  • 列表:[(值)* ]
  • 名称:/(常规字符)*
  • 字符串:((char)* )
  • 十六进制字符串:<(hexchar)* >
  • 数字:(-)? ((数字)+ |(数字)+ .(数字)* | .(数字)+)
  • 布尔值:true | false
  • 引用:(数字)+(空格)+(数字)+(空格)+ R
  • Dictionaries: << (name value)* >>
  • Lists: [ (value)* ]
  • Names: / (regular char)*
  • Strings: ( (char)* )
  • Hex strings: < (hexchar)* >
  • Numbers: (-)? ((digit)+ | (digit)+ . (digit)* | . (digit)+)
  • Booleans: true | false
  • References: (digit)+ (whitespace)+ (digit)+ (whitespace)+ R

在大多数地方,空白和注释都将被忽略. 注释以%开头,一直运行到该行的末尾.

Whitespace and comments are ignored in most places. Comments start with % and run until the end of the line.

间接对象指定为:

1 0 obj
(any object)
endobj

然后可以将该对象引用为1 0 R.间接词典也可以附加流:

This object can then be referenced as 1 0 R. Indirect dictionaries can also have a stream attached:

1 0 obj
<<
/Length 22
>>
stream
(22 bytes of raw data)
endstream
endobj

PDF文件看起来像这样:

A PDF file looks something like this:

%PDF-1.4
%ÿÿÿÿ
1 0 obj
<< /Author (MizardX) >>
endobj
2 0 obj
<<
/Type /Catalog
% more required keys
>>
endobj
%lots of more indirect objects, one after another
trailer
<<
/Info 1 0 R
/Root 2 0 R
% ... more required keys
>>
xref
0 3
0000000000 65535 f
0000000015 00000 n
0000000054 00000 n
startxref
225
%%EOF

对象树的根是trailer对象.每个对象都直接或间接地从此字典中引用.

The root of the object tree is the trailer object. Every objects is referenced directly or indirectly from this dictionary.

流中隐藏了很多复杂性,但这并不影响文件结构.

There are a lot more complexity hidden inside the streams, but that does not affect the file structure.

完整规范可在 Adob​​e网站中找到.