更新时间:2023-11-30 17:45:52
绝对做什么@ChrisBint说,将超过Window的缓慢文件夹与许多文件。
但要获得更快的速度,请务必使用 PdfReader
的重载, > RandomAccessFileOrArray
RandomAccessFileOrArray(string filename,bool forceRead)
。第二个参数是是否将整个文件加载到内存(如果我正确理解文档)。对于非常大的文件,这可能是一个性能命中,但在现代机器上它不应该重要,所以我建议你传递 true
到这。如果你通过 false
,需要多次敲击磁盘,因为解析cursor遍历文件。 所以,你可以在一个非常紧的循环中做到这一点。对我来说,包含总共超过42,000页的4,000个文件大约需要2秒。
var files = Directory.EnumerateFiles workingFolder,* .pdf);
int totalPageCount = 0;
foreach(文件中的字符串f)
{
totalPageCount + = new PdfReader(new RandomAccessFileOrArray(f,true),null).NumberOfPages;
}
MessageBox.Show(String.Format(Total Page Count:{0:N0},totalPageCount));
I have this piece of code:
foreach(string pdfFile in Directory.EnumerateFiles(selectedFolderMulti_txt.Text,"*.pdf",SearchOption.AllDirectories))
{
//filePath = pdfFile.FullName;
//string abc = Path.GetFileName(pdfFile);
try
{
//pdfReader = new iTextSharp.text.pdf.PdfReader(filePath);
pdfReader = new iTextSharp.text.pdf.PdfReader(pdfFile);
rownum = pdfListMulti_gridview.Rows.Add();
pdfListMulti_gridview.Rows[rownum].Cells[0].Value = counter++;
//pdfListMulti_gridview.Rows[rownum].Cells[1].Value = pdfFile.Name;
pdfListMulti_gridview.Rows[rownum].Cells[1].Value = System.IO.Path.GetFileName(pdfFile);
pdfListMulti_gridview.Rows[rownum].Cells[2].Value = pdfReader.NumberOfPages;
//pdfListMulti_gridview.Rows[rownum].Cells[3].Value = filePath;
pdfListMulti_gridview.Rows[rownum].Cells[3].Value = pdfFile;
//totalpages += pdfReader.NumberOfPages;
}
catch
{
//MessageBox.Show("There was an error while opening '" + pdfFile.Name + "'", "Error!", MessageBoxButtons.OK, MessageBoxIcon.Error);
MessageBox.Show("There was an error while opening '" + System.IO.Path.GetFileName(pdfFile) + "'", "Error!", MessageBoxButtons.OK, MessageBoxIcon.Error);
}
}
Problem is that when today I specified a folder having about 4000 pdf files, It took about 20 minutes to read all files and show me the results. Then, I thought what will this code do when I will input a folder having more than 20,000 files.
If I comment out this line:
pdfListMulti_gridview.Rows[rownum].Cells[2].Value = pdfReader.NumberOfPages;
Then, it seems if all of the processing burden is removed from the code.
So, what I want from you guys is a suggestion for making my approach efficient and less time should be taken to process all files. Or there is any alternative?
Definitely do what @ChrisBint said, that will get past Window's slowness with folders with many files.
But to get even more speed make sure to use the overload of PdfReader
that takes a RandomAccessFileOrArray
object instead. This object is way faster than regular streams in all of my testings. The constructor has a couple of overloads but you should mainly concern yourself with RandomAccessFileOrArray(string filename, bool forceRead)
. The second parameter is whether or not to load the entire file into memory (if I'm understanding the documentation correctly). For very large files this might be a performance hit but on modern machines it shouldn't matter much so I recommend that you pass true
to this. If you pass false
the disk will need to be hit several times as the parsing "cursor" walks through the file.
So with all of that you can do this in a very tight loop. For me, 4,000 files containing a total of over 42,000 pages takes about 2 seconds to run.
var files = Directory.EnumerateFiles(workingFolder, "*.pdf");
int totalPageCount = 0;
foreach (string f in files)
{
totalPageCount += new PdfReader(new RandomAccessFileOrArray(f, true), null).NumberOfPages;
}
MessageBox.Show(String.Format("Total Page Count : {0:N0}", totalPageCount));