且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

检测PDF文件是否正确(标题PDF)

更新时间:2023-11-29 15:34:34

a.不幸的是,没有简单的方法可以确定 pdf 文件是否已损坏.通常,问题文件具有正确的标题,因此损坏的真正原因是不同的.PDF 文件实际上是 PDF 对象的转储.该文件包含一个参考表,给出了每个对象从文件开头的确切字节偏移位置.因此,很可能损坏的文件具有损坏的偏移量,或者可能遗漏了某些对象.

a. Unfortunately, there is no easy way to determine is pdf file corrupt. Usually, the problem files have a correct header so the real reasons of corruption are different. PDF file is effectively a dump of PDF objects. The file contains a reference table giving the exact byte offset locations of each object from the start of the file. So, most probably corrupted files have a broken offsets or may be some object is missed.

检测损坏文件的***方法是使用专门的 PDF 库..NET 有许多免费和商业 PDF 库.您可以简单地尝试使用此类库之一加载 PDF 文件.iTextSharp 将是一个不错的选择.

The best way to detect the corrupted file is to use specialized PDF libraries. There are lots of both free and commercial PDF libraries for .NET. You may simply try to load PDF file with one of such libraries. iTextSharp will be a good choice.

B.根据 PDF 参考,PDF 文件的标题通常看起来像 %PDF-1.X(其中 X 是一个数字,目前从 0 到 7).并且 99% 的 PDF 文件都有这样的标题.但是,Acrobat Viewer 接受其他类型的标题,对于 PDF 查看器来说,即使没有标题也不是真正的问题.因此,如果文件不包含标题,则不应将文件视为已损坏.例如,标题可能出现在文件的前 1024 个字节内的某处,或者采用 %!PS-Adobe-N.n PDF-M.m

b. According to the PDF reference the header of a PDF file usually looks like %PDF−1.X (where X is a number, for the present from 0 to 7). And 99% of PDF files have such header. However, there are some other kinds of headers which Acrobat Viewer accepts and even absence of a header isn't a real problem for PDF viewers. So, you shouldn't treat file as corrupted if it does not contain a header. E.g., the header may be appeared somewhere within the first 1024 bytes of the file or be in the form %!PS−Adobe−N.n PDF−M.m

仅供参考,我是 Docotic PDF 库的开发人员.

Just for your information I am a developer of the Docotic PDF library.