且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

数据挖掘是否支持英语以外的其他语言?

更新时间:2023-12-04 22:50:28

答案与往常一样:是和否.

尽管实际上没有理论上的问题,但是亚洲语言还是有一些实际问题.典型的文本数据挖掘管道包括

  • 充油(运行->运行)
  • 删除停用词(a,the,...)和其他没有帮助的词
  • 丰富步骤,例如短语检测
  • tokeniztion
  • 转换为单词袋(Hello World,Hello Japan->(Hello:2,World:1,Japan:1)),它计算每个单词的出现频率.
  • 您喜欢的文本挖掘技术(如LDA或SVM)的应用

第一步和第四步实际上在某些亚洲语言中构成问题.欧洲语言,尤其是英语.英语单词以空格开头,以空格结尾.在某些亚洲语言中,如果不理解句子的含义,就无法将字符序列标记为单词.实际上,在某些语言中,这非常困难. (参见Wiki上关于标记化的***,对于用scriptiocontinua编写的不具有单词边界的语言(例如古希腊语,中文,[1]或泰语),标记化特别困难.

发芽还可能带来问题.用英语非常好理解.在其他语言中,这取决于.

如果您可以解决这两个问题,则可以将典型的文本挖掘技术也应用于亚洲语言.

I am new to data mining. I would like to do some data mining, whereas the data is not English, they are japanese or chinese wording.

Does data mining support these languages? If yes, how can we achieve? Any tools and blogs.

Appreciate if you can help.

The answer is as usual: Yes and no.

While in fact there are no theoretical problems there are some practical problems with asian languages. A typical data mining pipeline for text consist of

  • stemming (running -> run)
  • removal of stop words (a, the,...) and other words which do not help
  • enrichment steps, e.g., phrase detection
  • tokeniztion
  • transformation into bag of words (Hello World, Hello Japan -> (Hello:2, World:1, Japan:1) which counts the frequency of each word.
  • application of your favourite text mining techniques like LDA or also SVMs

The first and forth step pose in fact a problem in some asian languages. In european languages, especially english. A word in english starts at a space and end in a space. In some asian languages you can not tokenise a sequence of character into words without understanding the meaning of the sentence. In fact in some languages it is extremely hard. (c.f. Wiki on tokenisation Tokenization is particularly difficult for languages written in scriptio continua which exhibit no word boundaries such as Ancient Greek, Chinese,[1] or Thai.)

Also stemming might pose a problem. In english it is extremely well understood. In other languages it depends.

If you can solve these two problems you can apply the typical text mining techniques also on asian languages.