是否可以使用 Apache Tika 提取表信息?

更新时间：2023-01-29 20:08:54

好吧，我继续使用 apache poi 为 MS 格式单独实现了它.我回到 Tika 阅读 PDF.Tika 对文档所做的是将其输出为基于 SAX 的 XHTML 事件"1

所以基本上我们可以编写一个自定义的 SAX 实现来解析文件.

结构文本输出将采用以下形式(避免元细节)


<p>Key1 Value1 </p><p>Key2 Value2 </p><p>Key3 Value3</p><p/>

在我们的 SAX 实现中，我们可以将第一部分视为键(对于我的问题，我已经知道键并且我正在寻找值，所以它是一个子字符串).

用逻辑覆盖 public void characters(char[] ch, int start, int length)

请注意，就我而言，内容的结构是固定的，而且我知道输入的密钥，因此这样做很容易.这不是通用的解决方案

I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any. Any pointers for this?

Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as "SAX based XHTML events"1

So basically we can write a custom SAX implementation to parse the file.

The structure text output will be of the form (Meta details avoided)

<body><div class="page"><p/>
<p>Key1 Value1 </p>
<p>Key2 Value2 </p>
<p>Key3 Value3</p>
<p/>
</div>
</body>

In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).

Override public void characters(char[] ch, int start, int length) with the logic

Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution

上一篇 : ：是否可以使用Apache Tika提取表信息?下一篇 : 是否可以使用Google Optimize在Javascript中读取ExperimentId和VariationId？

是否可以使用 Apache Tika 提取表信息?

相关阅读

技术问答最新文章