且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何检测文档的语言

更新时间:2023-02-26 13:39:58

文档未分类为英语和非英语.如果文档以三种语言编写,而其中一种是英语,该怎么办?您将如何对此类文档进行分类?您应该已经解释了您想要什么.

现在,许多英国作家在他们的文本中引用了拉丁语表达. (其他语言也是如此,但尤其是拉丁语.)大多数这样的拉丁语短语都可以用ASCII表示.此外,许多英语符号都使用Unicode,例如用于印刷正确的引号或破折号的符号,例如,等等.即使有字符也不是那么容易.信不信由你,"连字灰"( http://en.wikipedia.org/wiki/%C3%86 [ ^ ])是英语!请参阅 http://en.wikipedia.org/wiki/English_alphabet [
坦率而正确地讲,这种问题不能通过纯粹的技术手段来解决.与代码点范围不同,该语言未在任何地方标记.语言是与书写系统或脚本完全不同的东西.在HTML中,有一个"lang"属性,但是没有人必须使用它.潜在地,只有通过创建功能强大的专家系统(使用多种语言和语法规则集的全面词典)才能解决此问题.分析结果只能用模糊集理论或模糊逻辑来表示( http://en.wikipedia.org/wiki/Fuzzy_set [ ^ ], ^ ]):此文本为英语,确定性为96.4%".这样的事情.

有趣的?愿意深入研究吗?那祝你好运.

—SA
Documents are not classified into English and non-English. What if a document is written in three languages and one of those is English. How would you classify such document? You should have explained what do you want.

Now, many English writers quote Latin expressions in their texts. (Other languages, too, but especially Latin.) Most such Latin phrases can be expressed in ASCII. Moreover, many symbols in English use Unicode, such as those used for typographically correct quotation or dash characters, such as " " —, and a lot more. Even with characters it''s not so easy. Believe or not, the "ligature ash" (http://en.wikipedia.org/wiki/%C3%86[^]) is English! See http://en.wikipedia.org/wiki/English_alphabet[^].

How would you want to classify such document? And you won''t be able to analyze such citation based on the classification of code points, as I''ve demonstrated above.

There are other cases. For example, many Polish words use the same code point sub-set as English. There are exclusions like "Ł" or "ę". So, one can find some words which have some meaning in Polish and some meaning in English, maybe completely different. The same very word can be Polish or English at the same time, depending on context.

Honestly and correctly, such problem cannot be resolved by a purely technical technique. Unlike code point ranges, the language is not marked anywhere. Language is something completely different form a writing system or a script. In HTML there is a "lang" attribute, but nobody is obliged to use it. Potentially, such problem can only be solved by creation of powerful expert system which uses comprehensive dictionaries of many languages and grammar rule sets. The results of analysis can only be expressed in terms of fuzzy set theory or fuzzy logic (http://en.wikipedia.org/wiki/Fuzzy_set[^], http://en.wikipedia.org/wiki/Fuzzy_logic[^]): "this text is English with 96.4% certainty". Something like that.

Interesting? Care to delve into that? Good luck then.

—SA