
且构网 - 分享程序员编程开发的那些事


更新时间:2023-11-30 17:45:52


但要获得更快的速度,请务必使用 PdfReader 的重载, > RandomAccessFileOrArray

对象。在我的所有测试中,此对象比常规流更快方式。构造函数有一些重载,但你应该主要关心 RandomAccessFileOrArray(string filename,bool forceRead)。第二个参数是是否将整个文件加载到内存(如果我正确理解文档)。对于非常大的文件,这可能是一个性能命中,但在现代机器上它不应该重要,所以我建议你传递 true 到这。如果你通过 false ,需要多次敲击磁盘,因为解析cursor遍历文件。


  var files = Directory.EnumerateFiles workingFolder,* .pdf); 
int totalPageCount = 0;
totalPageCount + = new PdfReader(new RandomAccessFileOrArray(f,true),null).NumberOfPages;
MessageBox.Show(String.Format(Total Page Count:{0:N0},totalPageCount));

I have this piece of code:

foreach(string pdfFile in Directory.EnumerateFiles(selectedFolderMulti_txt.Text,"*.pdf",SearchOption.AllDirectories))
    //filePath = pdfFile.FullName;
    //string abc = Path.GetFileName(pdfFile);
        //pdfReader = new iTextSharp.text.pdf.PdfReader(filePath);
        pdfReader = new iTextSharp.text.pdf.PdfReader(pdfFile);
        rownum = pdfListMulti_gridview.Rows.Add();
        pdfListMulti_gridview.Rows[rownum].Cells[0].Value = counter++;
        //pdfListMulti_gridview.Rows[rownum].Cells[1].Value = pdfFile.Name;
        pdfListMulti_gridview.Rows[rownum].Cells[1].Value = System.IO.Path.GetFileName(pdfFile);
        pdfListMulti_gridview.Rows[rownum].Cells[2].Value = pdfReader.NumberOfPages;
        //pdfListMulti_gridview.Rows[rownum].Cells[3].Value = filePath;
        pdfListMulti_gridview.Rows[rownum].Cells[3].Value = pdfFile;
        //totalpages += pdfReader.NumberOfPages;
        //MessageBox.Show("There was an error while opening '" + pdfFile.Name + "'", "Error!", MessageBoxButtons.OK, MessageBoxIcon.Error);
        MessageBox.Show("There was an error while opening '" + System.IO.Path.GetFileName(pdfFile) + "'", "Error!", MessageBoxButtons.OK, MessageBoxIcon.Error);

Problem is that when today I specified a folder having about 4000 pdf files, It took about 20 minutes to read all files and show me the results. Then, I thought what will this code do when I will input a folder having more than 20,000 files.

If I comment out this line:

pdfListMulti_gridview.Rows[rownum].Cells[2].Value = pdfReader.NumberOfPages;

Then, it seems if all of the processing burden is removed from the code.

So, what I want from you guys is a suggestion for making my approach efficient and less time should be taken to process all files. Or there is any alternative?

Definitely do what @ChrisBint said, that will get past Window's slowness with folders with many files.

But to get even more speed make sure to use the overload of PdfReader that takes a RandomAccessFileOrArray object instead. This object is way faster than regular streams in all of my testings. The constructor has a couple of overloads but you should mainly concern yourself with RandomAccessFileOrArray(string filename, bool forceRead). The second parameter is whether or not to load the entire file into memory (if I'm understanding the documentation correctly). For very large files this might be a performance hit but on modern machines it shouldn't matter much so I recommend that you pass true to this. If you pass false the disk will need to be hit several times as the parsing "cursor" walks through the file.

So with all of that you can do this in a very tight loop. For me, 4,000 files containing a total of over 42,000 pages takes about 2 seconds to run.

        var files = Directory.EnumerateFiles(workingFolder, "*.pdf");
        int totalPageCount = 0;
        foreach (string f in files)
            totalPageCount += new PdfReader(new RandomAccessFileOrArray(f, true), null).NumberOfPages;
        MessageBox.Show(String.Format("Total Page Count : {0:N0}", totalPageCount));