更新时间:2023-12-05 15:45:28
为派对迟到,这里是一个简单的解决方案,意味着已经包含字体的pdf文件不仅仅是基于图像的:
find ./ -name* .pdf-print0 | xargs -0 -I {} \
bash -c'export file ={}; \
if [$(pdffonts$ file2> / dev / null | \
wc -l)-lt 3];然后回显$ file; f'
作为单行
说明: pdffonts file.pdf $如果pdf包含文本,c $ c>将显示超过2行。
输出不包含文本的所有pdf文件的文件名。
我的具有相同功能的OCR项目在Github中 deajan / pmOCR 。
I have many PDF documents in my system, and I notice sometimes that documents are image-based without editing capability. In this case, I do OCR for better search in Foxit PhantomPDF where you can do OCR in multiple files. I would like to find all PDF documents of mine which are image-based.
I do not understand how the PDF reader can recognize that the document's OCR is not textual. There must be some fields which these readers access. This can be accessed in terminal too. This answer gives open proposals how to do it in the thread Check if a PDF file is a scanned one:
Your best bet might be to check to see if it has text and also see if it contains a large pagesized image or lots of tiled images which cover the page. If you also check the metadata this should cover most options.
I would like to understand better how you can do this effectively, since if there exists some metafield, then it would be easy. However, I have not found such a metafield. I think the most probable approach is to see if the page contains pagesized image which has OCR for search because it is effective and used in some PDF readers already. However, I do not know how to do it.
In Hugh transform, there are specifically chosen parameters in the hyper-square of the parameter space. Its complexity $O(A^{m-2})$ where m is the amount of parameters where you see that with more than there parameters the problem is difficult. A is the size of the image space. Foxit reader is using most probably 3 parameters in their implementation. Edges are easy to detect well which can ensure the efficiency and must be done before Hugh transform. Corrupted pages are simply ignored. Other two parameters are still unknown but I think they must be nodes and some intersections. How these intersections are computed is unknown? The formulation of the exact problem is unknown.
The command works in Debian 8.5 but I could not manage to get it work initially in Ubuntu 16.04
masi@masi:~$ find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'
./Downloads/596P.pdf
./Downloads/20160406115732.pdf
^C
OS: Debian 8.5 64 bit
Linux kernel: 4.6 of backports
Hardware: Asus Zenbook UX303UA
Being late for the party, here's a simple solution implying that pdf files already containing fonts aren't image based only:
find ./ -name "*.pdf" -print0 | xargs -0 -I {} \
bash -c 'export file="{}"; \
if [ $(pdffonts "$file" 2> /dev/null | \
wc -l) -lt 3 ]; then echo "$file"; fi'
As one-liner
find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'
Explanation:
pdffonts file.pdf
will show more than 2 lines if pdf contains text.
Outputs filenames of all pdf files that don't contain text.
My OCR project which has the same feature is in Github deajan/pmOCR.