且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用Stanford NLP检测语言

更新时间:2023-02-26 13:44:55

几乎可以确定,斯坦福大学COreNLP目前没有语言标识. '几乎'-因为不存在很难证明.

Almost certainly there is no language identification in Stanford COreNLP at this moment. 'almost' - because nonexistence is much harder to prove.

不过,以下是间接证据:

Nevertheless, below are circumstantial evidences:

  1. 主要 页,也没有 CoreNLP页,也没有在 2014 CoreNLP作者的论文
  2. 结合了多个NLP库的
  3. 工具 包括Stanford CoreNLP,请使用另一个lib作为语言 标识,例如 DKPro Core ASL ;还其他 谈论语言识别和CoreNLP的用户没有提及此功能
  4. CoreNLP的源文件包含Language 类,但与语言识别无关-您可以 手动检查所有84个出现的语言"单词
  1. there is no mention of language identification neither on main page, nor CoreNLP page, nor in FAQ (although there is a question 'How do I run CoreNLP on other languages?'), nor in 2014 paper of CoreNLP's authors;
  2. tools that combine several NLP libs including Stanford CoreNLP use another lib for language identification, for example DKPro Core ASL; also other users talking about language identification and CoreNLP don't mention this capability
  3. source file of CoreNLP contains Language classes, but nothing related to language identification - you can check manually for all 84 occurrence of 'language' word here

尝试 TIKA Java语言检测库(他们报告"53种语言的精度提高了99%").

Try TIKA, or TextCat, or Language Detection Library for Java (they report "99% over precision for 53 languages").

通常,质量取决于输入文本的大小:如果输入文本足够长(例如,至少几个单词并且没有特别选择),则精度可以很好-约为95%.

In general, quality depends on the size of input text: if it is long enough (say, at least several words and not specially chosen), then precision can be pretty good - about 95%.