更新时间:2023-12-04 22:50:28
答案与往常一样:是和否.
尽管实际上没有理论上的问题,但是亚洲语言还是有一些实际问题.典型的文本数据挖掘管道包括
第一步和第四步实际上在某些亚洲语言中构成问题.欧洲语言,尤其是英语.英语单词以空格开头,以空格结尾.在某些亚洲语言中,如果不理解句子的含义,就无法将字符序列标记为单词.实际上,在某些语言中,这非常困难. (参见Wiki上关于标记化的***,对于用scriptiocontinua编写的不具有单词边界的语言(例如古希腊语,中文,[1]或泰语),标记化特别困难.
发芽还可能带来问题.用英语非常好理解.在其他语言中,这取决于.
如果您可以解决这两个问题,则可以将典型的文本挖掘技术也应用于亚洲语言.
I am new to data mining. I would like to do some data mining, whereas the data is not English, they are japanese or chinese wording.
Does data mining support these languages? If yes, how can we achieve? Any tools and blogs.
Appreciate if you can help.
The answer is as usual: Yes and no.
While in fact there are no theoretical problems there are some practical problems with asian languages. A typical data mining pipeline for text consist of
The first and forth step pose in fact a problem in some asian languages. In european languages, especially english. A word in english starts at a space and end in a space. In some asian languages you can not tokenise a sequence of character into words without understanding the meaning of the sentence. In fact in some languages it is extremely hard. (c.f. Wiki on tokenisation Tokenization is particularly difficult for languages written in scriptio continua which exhibit no word boundaries such as Ancient Greek, Chinese,[1] or Thai.)
Also stemming might pose a problem. In english it is extremely well understood. In other languages it depends.
If you can solve these two problems you can apply the typical text mining techniques also on asian languages.