且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Java 中是否有用于文本分析/挖掘的 API?

更新时间:2023-01-25 17:16:44

例如 - 你可以使用标准库 java.text 中的一些类,或者使用 StreamTokenizer (您可以根据您的要求定制它).但是如您所知 - 来自互联网来源的文本数据通常有很多拼写错误,为了获得更好的性能,您必须使用诸如模糊标记器 - java.text 和其他标准实用程序在这种情况下的功能太有限.

For example - you might use some classes from standard library java.text, or use StreamTokenizer (you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.

因此,我建议您使用正则表达式(java.util.regex)并根据您的需要创建自己的标记器.

So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.

附言根据您的需要 - 您可以创建状态机解析器来识别原始文本中的模板部分.您可能会在下图中看到简单的状态机识别器(您可以构建更高级的解析器,它可以识别文本中更复杂的模板).

P.S. According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).