且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用MCID内容获取标记的内容

更新时间:2023-12-05 21:41:04

基于您添加到问题中的标签,我看到您正在添加iText7.iText7具有名为

Based on the tags you added to the question, I see that you are adding iText 7. iText 7 has a class named TaggedPdfReaderTool. This class can be used to convert Tagged PDF files to XML:

FileOutputStream outXml = new FileOutputStream("pdf_content.xml");
TaggedPdfReaderTool tool = new TaggedPdfReaderTool(document);
tool.setRootTag("root");
tool.convertToXml(outXml);
outXml.close();

XML将具有与您已经能够提取的标签结构"相同的结构. XML标签内的内容将与PDF内容流中标记为标签的一部分"的内容相对应.

The XML will have the same structure are the "tag structure" you were already able to extract. The content inside the XML tags will correspond with the content that is marked as "part of a tag" in the PDF content stream.

给其他读者的重要消息:问题中的屏幕截图清楚地显示了PDF带有标签.如果您在未标记的PDF上尝试使用此代码段,则无法将内容转换为PDF.

Important message to other readers: the screen shot in the question clearly shows that the PDF is tagged. If you try this code snippet on a PDF that isn't tagged, you won't be able to convert the content to PDF.

更新:较低级别的方法

您还可以像这样检查结构树的所有部分:process(document.getStructTreeRoot());

You can also examine all the parts of the structure tree like this: process(document.getStructTreeRoot());

process()方法的外观如下:

public static void process(IPdfStructElem elem) {
    if (elem == null) return;
    System.out.println(elem.getRole());
    System.out.println(elem.getClass().getName());
    if (elem instanceof PdfStructElem) {
        processStructElem((PdfStructElem) elem);
    }
    if (elem.getKids() == null) return;
    for (IPdfStructElem structElem : elem.getKids()) {
        process(structElem);
    }
}

public static void processStructElem(PdfStructElem elem) {
    PdfDictionary page = elem.getPdfObject().getAsDictionary(PdfName.Pg);
    if (page == null) return;
    PdfStream contents = page.getAsStream(PdfName.Contents);
    if (contents != null) {
        System.out.println(new String(contents.getBytes()));
    }
    PdfArray array = page.getAsArray(PdfName.Contents);
    System.out.println(array);
}

请注意,页面的/Contents可以引用单个流,也可以引用流的数组.在这个简短的代码片段中,我忽略了存储在流数组中的所有/Contents.

Note that the /Contents of a page can refer to a single stream, or to an array of streams. In this short snippet, I ignored all /Contents stored in an array of streams.

这是在用于测试的带标签的PDF上执行时显示的内容示例:

This is an example of the content that was revealed when executing this on a tagged PDF we use for tests:

EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
36 432.34 184.23 27.98 re
f
Q
EMC
/Span <</MCID 13>> BDC
q
BT
/F2 12 Tf
42 442.65 Td
1 1 1 rg
(The Library)Tj
ET
Q
EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
36 399.11 184.23 27.98 re
f
Q
EMC
/Span <</MCID 14>> BDC
q
BT
/F2 12 Tf
42 409.42 Td
1 1 1 rg
(The Company)Tj
ET
Q
EMC
/Span <</MCID 15>> BDC
q
BT
/F1 20 Tf
227.73 472.71 Td
(The Library)Tj
ET
Q
EMC
/Span <</MCID 16>> BDC
q
BT
/F2 12 Tf
229.23 440.45 Td
(iText is a software developer toolkit that allows users to integrate PDF)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 17>> BDC
q
BT
/F2 12 Tf
229.23 424.46 Td
(functionalities within their applications, processes or products.)Tj
ET
Q
EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
605.03 262.75 191.73 235.31 re
f
Q
EMC
/Span <</MCID 18>> BDC
q
BT
/F1 16 Tf
676.45 482.5 Td
0.97647 0.76078 0.15294 rg
(What?)Tj
ET
Q
EMC
/Span <</MCID 19>> BDC
q
BT
/F2 12 Tf
607.94 453.08 Td
1 1 1 rg
(iText is a software developer toolkit)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 20>> BDC
q
BT
/F2 12 Tf
611.61 437.09 Td
1 1 1 rg
(that allows users to integrate PDF)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 21>> BDC
q
BT
/F2 12 Tf
634.95 421.11 Td
1 1 1 rg
(functionalities within their)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 22>> BDC
q
BT
/F2 12 Tf
669.96 405.12 Td
1 1 1 rg
(applications)Tj
ET
Q
EMC
/Span <</MCID 23>> BDC
q
BT
/F1 16 Tf
679.12 381.5 Td
0.97647 0.76078 0.15294 rg
(How?)Tj
ET
Q
EMC
/Span <</MCID 24>> BDC
q
BT
/F2 12 Tf
613.94 352.08 Td
1 1 1 rg
(By providing you with the tools to)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 25>> BDC
q
BT
/F2 12 Tf
607.59 336.09 Td
1 1 1 rg
(create and manipulate a pdf in your)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 26>> BDC
q
BT
/F2 12 Tf
668.96 320.11 Td
1 1 1 rg
(source code)Tj
ET
Q
EMC
/Span <</MCID 27>> BDC
q
BT
/F1 16 Tf
672.44 296.49 Td
0.97647 0.76078 0.15294 rg
(Really?)Tj
ET
Q
EMC
/Span <</MCID 28>> BDC
q
BT
/F2 12 Tf
673.64 267.06 Td
1 1 1 rg
(Yes really!)Tj
ET
Q
EMC

不在BMC/EDCBDC/EDC运算符之间的所有内容均未标记.您正在寻找标有MCID的内容.

Everything that is not between BMC/EDC or BDC/EDC operators is not tagged. You are looking for the content that is marked with an MCID.

在评论中,我解释说***使用其他方法.***解析每个页面的内容流(仅一次),并映射结构树中元素遇到的所有对象.

In a comment, I explain that it's better to use a different approach. It is better to parse the content streams of every page (only once) and map all objects you encounter with the elements in the structure tree.

使用这种方法,您必须为每个结构元素一遍又一遍地解析页面的内容流.这需要更多的处理.

With your approach, you have to parse the content stream of a page over and over again for every structure element. That requires much more processing.