且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Apache Tika 是否能够提取中文、日文等外语?

更新时间:2021-11-13 09:02:06

Apache Tika 能够从其支持的文件格式中提取 unicode 文本.只要文件格式可以存储unicode文本(例如中文或日文字符),Apache Tika就可以提取出来

Apache Tika is able to extract unicode text from its supported file formats. As long as the file format can store unicode text (eg Chinese or Japanese characters), Apache Tika can extract it

Tika 还为此包含了许多单元测试,以验证它是否有效.一种这样的测试使用 此中文电子邮件示例.如果使用命令行 Tika 应用程序,并抓取前几行,我们会看到它工作:

Tika also includes a number of unit tests for this, which verify it works. One such test uses this sample chinese email. If with use the command line Tika app, and grab the first few lines, we see it working:

$ java -jar tika-app-1.4.jar --text testMSG_chinese.msg | head
Alfresco MSG format testing ( MSG 格式測試 )
    From
    Tests Chang@FT (張毓倫)
    To
    Tests Chang@FT (張毓倫)
    Recipients
    tests.chang@fengttt.com

或者用这个 日语文档:

$ java -jar tika-app-1.4.jar --text testRTFJapanese.rtf | head -2
ゾルゲの処刑記録、
ゾルゲと尾崎、淡々と最期 

您只需要确保您生成的任何文本输出都以合适的编码(例如 utf8)存储,并且您用来显示它的字体支持这些字形!

You'll just need to ensure that any text output you generate gets stored in a suitable encoding (eg utf8), and the font you use to display it supports those glyphs!