且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从二进制数据中识别没有扩展名的文件的类型

更新时间:2022-11-28 18:24:29

您可以读取文件的前几个字节,然后查找幻数。关于幻数的***页表明PDF文件以ASCII %开始PDF 和doc文件以十六进制开头D0 CF 11 E0。



识别文本文件在一般情况下会非常困难,因为很多标准魔术数字实际上是ASCII文本在二进制文件的开头。对于你的情况,如果你能保证你不会得到任何东西,但PDF,DOC或TXT,你可能会逃避检查的PDF和DOC幻数,然后假设它的文本,如果它不是那些。


I have some files without extension. I would like associate extensions to them. For that I have written a python program to read the data in the file. My doubt is how can I identify its type without the extension without using third party tools.

I have to identify a pdf, doc and text file only. Other type of files are not possible.

My server is cent os

You could read the first few bytes of the file and look for a "magic number". The Wikipedia page on magic numbers suggests that PDF files begin with ASCII %PDF and doc files begin with hex D0 CF 11 E0.

Identifying text files is going be pretty tough in the general case, because a lot of standard magic numbers are actually ASCII text at the beginning of a binary file. For your case, if you can guarantee that you won't be getting anything but PDF, DOC, or TXT, what you could probably get away with is checking for the PDF and DOC magic numbers, and then assuming it's text if it's not either of those.