且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

PDF查看文本是否带下划线或表格单元格

更新时间:2023-12-04 23:21:16

这是我到目前为止所发现的:



PDFBox使用资源文件将PDF操作符/指令绑定到某些类,然后处理这些信息。



如果我们看看在 PDFTextStripper.properties 资源文件下:


pdfbox\src \\ \\ main\resources\org\apache\pdfbox\resources\


我们可以看到BT,例如BT operator绑定到
org.apache.pdfbox.util.operator.BeginText 类,依此类推。



PDFTextStripper


pdfbox\src \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\使用此类处理PDF。



但是所有图形对象都被忽略,因此没有下划线或表格结构的信息!



现在,如果我们看一下 PageDrawer.properties 资源文件,我们可以看到这个文件几乎绑定了所有可用的运算符。 PageDrawer 类在


pdfbox \ src \\\\\\\\\\\\\\\ \\ apache\pdfbox \pdfviewer \


诀窍现在是找出代表下划线的那些图形运算符和表格以及与 PDFTextStripper 结合使用。



现在这意味着要阅读PDF文件规范,这是目前的工作方式。



如果有人知道哪些运营商负责绘制下划线和表格行的行为,请告诉我。


I have been playing around with PdfBox and PDFTextStripperByArea method.

I was able to extract information if the text is bold or italic, but I'm unable to get the underline information.

As far as I understand it in PDF, underline is done by drawing lines. So in theory I should be able to get some sort of information about lines somewhere around the text. Giving this information I could then find out if either text is underlined or in a table.

Here is my code so far:

List<TextPosition> textPos = charactersByArticle.get(index);

for (TextPosition t : textPos)
{               
    if (t.getFont().getFontDescriptor() != null)
    {                           
        if (t.getFont().getFontDescriptor().getFontWeight() > BOLD_WEIGHT ||
            t.getFont().getFontDescriptor().isForceBold())
        {
            isBold = true;
        }

        if (t.getFont().getFontDescriptor().isItalic())
        {
            isItalic = true;
        }
    }
}

I have tried to play around the PDGraphicsState object which is processed in the processEncodedText method in PDFStreamEngine class but no information of lines found there.

Any suggestions where this information could be retrieved from ?

Here is what I have found out so far:

PDFBox uses a resource file to bound PDF operators/instructions to certain classes which then process the information.

If we take a look at the PDFTextStripper.properties resource file under:

pdfbox\src\main\resources\org\apache\pdfbox\resources\

we can see that for instance the BT operator is bound to the org.apache.pdfbox.util.operator.BeginText class and so on.

The PDFTextStripper under

pdfbox\src\main\java\org\apache\pdfbox\util\

takes this into account and utilizes the processing of the PDF with this classes.

BUT all graphical objects are ignored, therefore no information of underline or table structure!

Now if we take a look at the PageDrawer.properties resource file we can see that this one bounds to almost all operators available. Which is utilized by PageDrawer class under

pdfbox\src\main\java\org\apache\pdfbox\pdfviewer\

The "trick" is now to find out which graphical operators are those who represent underline and tables and to use them in combination with PDFTextStripper.

Now this would mean reading the PDF file specification, which is currently way to much work.

If someone knows which operators are responsible for which actions to draw underlines and table lines please let me know.