且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何检测文件是否不是c#中的文本文件

更新时间:2023-02-20 11:27:46

对此,没有万无一失的答案。如果你知道任何文本文件只能是ASCII字符(用ASCII,UTF-8或类似的东西编码),那么是的,这将工作...虽然它可能无法捕获所有的非全部字符,文本文件。



但是:


  • 任何文本文件都会失败非ASCII文本

  • 对于某种格式的文件而言,该文件仍然可能失败,但该文件不包含任何超过128的值。



字节序列{34,87,23,10}是否代表文本或二进制数据?确实无法知道。 你做的任何事情都是启发式的。


I need to read through many files and search for specific text in them. I want to open only text files, i.e., no image, movie, etc. files. I am looking for a way to identify non-text files. Since I will be using a FileStream and doing a byte search, it seems to me I can stop reading and close a file if a byte whose decimal value is greater than 128 is encountered. Does this seem like a good approach?

There's no foolproof answer for this. If you know that any text files will only ever be ASCII characters (and encoded in ASCII, UTF-8 or something similar) then yes, that will work... although it may not catch all non-text files.

However:

  • It will fail for any text files using non-ASCII text
  • It could still fail for a file which is a valid binary file for some format, but happens not to contain any values above 128.

Does the sequence of bytes { 34, 87, 23, 10 } represent text or binary data? There's simply no way of knowing for sure. Anything you do will be heuristic.