且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从PDF中删除所有文本

更新时间:2023-12-05 15:49:16

改编如何使用PDFTK(或其他命令行应用程序)查找和替换现有PDF文件中的文本我能够使用pdftk和sed删除渲染文本。这肯定不是完全一般的,但是对我的需求是快速的黑客。

Adapting this How to find and replace text in a existing PDF file with PDFTK (or other command line application) I was able to delete the rendered text by using pdftk and sed. This is surely not fully general, but was a quick hack for my needs.

我最终得到:

pdftk my_input.pdf output - uncompress | sed -e 's/\[.*\]TJ/()Tj/' -e 's/(.*)Tj/()TJ/' | pdftk - output my_output.pdf compress

这将流转换为文本格式,我在其中找到( blah)Tj和[blah] TJ然后完全将它们剪掉,然后转换回压缩二进制文件。 pdftk做了一些魔术来修复输出,使其再次有效,因为原始未编辑的输入也是有效的PDF文件,但编辑后却没有。如果没有一些新模式,这对扩展字符不起作用。

This converts the streams to text format, where I find uses of (blah)Tj and [blah]TJ and just snip them out entirely, then convert back to compressed binary. pdftk does some magic to fix up the output so that it is valid again, because the original unedited input is also a valid PDF file, but not after editing. This will not work with extended characters without some new patterns.