且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

尝试使用iTextSharp从PDF中删除内嵌图像时出现问题

更新时间:2023-12-05 15:44:52

正如你所发现的那样,正如我在评论中指出的那样,在没有采取内容的情况下操纵内容流并不是一个好主意。查看流中的每个运算符。您真的需要解析语法并解释每个运算符和每个操作数。

As you've found out and as mkl and I pointed out in the comments, it's not a good idea to manipulate a content stream without taking a look at every operator in the stream. You really need to parse the syntax and interpret every single operator and every single operand.

请查看iText提供的额外jar中的OCG删除功能在 com.itextpdf.text.pdf.ocg / 包。

Please take a look at the OCG removing functionality in the extra jar that is provided with iText in the com.itextpdf.text.pdf.ocg/ package.

OCGParser 类中,我们定义所有可能的运算符:

In the OCGParser class, we define all possible operators:

protected void populateOperators() {
    if (operators != null)
        return;
    operators = new HashMap<String, PdfOperator>();
    operators.put(DEFAULTOPERATOR, new CopyContentOperator());
    PathConstructionOrPaintingOperator opConstructionPainting = new PathConstructionOrPaintingOperator();
    operators.put("m", opConstructionPainting);
    operators.put("l", opConstructionPainting);
    operators.put("c", opConstructionPainting);
    operators.put("v", opConstructionPainting);
    operators.put("y", opConstructionPainting);
    operators.put("h", opConstructionPainting);
    operators.put("re", opConstructionPainting);
    operators.put("S", opConstructionPainting);
    operators.put("s", opConstructionPainting);
    operators.put("f", opConstructionPainting);
    operators.put("F", opConstructionPainting);
    operators.put("f*", opConstructionPainting);
    operators.put("B", opConstructionPainting);
    operators.put("B*", opConstructionPainting);
    operators.put("b", opConstructionPainting);
    operators.put("b*", opConstructionPainting);
    operators.put("n", opConstructionPainting);
    operators.put("W", opConstructionPainting);
    operators.put("W*", opConstructionPainting);
    GraphicsOperator graphics = new GraphicsOperator();
    operators.put("q", graphics);
    operators.put("Q", graphics);
    operators.put("w", graphics);
    operators.put("J", graphics);
    operators.put("j", graphics);
    operators.put("M", graphics);
    operators.put("d", graphics);
    operators.put("ri", graphics);
    operators.put("i", graphics);
    operators.put("gs", graphics);
    operators.put("cm", graphics);
    operators.put("g", graphics);
    operators.put("G", graphics);
    operators.put("rg", graphics);
    operators.put("RG", graphics);
    operators.put("k", graphics);
    operators.put("K", graphics);
    operators.put("cs", graphics);
    operators.put("CS", graphics);
    operators.put("sc", graphics);
    operators.put("SC", graphics);
    operators.put("scn", graphics);
    operators.put("SCN", graphics);
    operators.put("sh", graphics);
    XObjectOperator xObject = new XObjectOperator();
    operators.put("Do", xObject);
    InlineImageOperator inlineImage = new InlineImageOperator();
    operators.put("BI", inlineImage);
    operators.put("EI", inlineImage);
    TextOperator text = new TextOperator();
    operators.put("BT", text);
    operators.put("ID", text);
    operators.put("ET", text);
    operators.put("Tc", text);
    operators.put("Tw", text);
    operators.put("Tz", text);
    operators.put("TL", text);
    operators.put("Tf", text);
    operators.put("Tr", text);
    operators.put("Ts", text);
    operators.put("Td", text);
    operators.put("TD", text);
    operators.put("Tm", text);
    operators.put("T*", text);
    operators.put("Tj", text);
    operators.put("'", text);
    operators.put("\"", text);
    operators.put("TJ", text);
    MarkedContentOperator markedContent = new MarkedContentOperator();
    operators.put("BMC", markedContent);
    operators.put("BDC", markedContent);
    operators.put("EMC", markedContent);
}

parse()方法将查看所有内容流,包括Form XObjects的内容流(如果我正确理解你的代码,你会忽略它)。

The parse() method will look at all the content streams, including the content streams of Form XObjects (which you are overlooking if I understand your code correctly).

进程中()方法,我们制作每个运算符及其所有操作数的副本,除非某些条件告诉我们需要删除部分语法。

In the process() method, we make copies of every operator and all its operands, unless some condition tells us that part of the syntax needs to be removed.

你应该调整这段代码这样所有操作符都被复制,除了那些涉及内嵌图像的操作符。你的方法是一种蛮力方法,必然会损坏更多的PDF文件。

You should adapt this code so that all operators are copied, except those that involve an inline images. Your approach was a brute force approach that was bound to damage more PDFs than it would ever fix.