且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用stemCompletion函数(tm包)从字典中完成一个词干语料库

更新时间:2022-06-16 21:17:11

你需要改变两件事

  1. 当你使用自定义函数时,你需要使用 content_transformer

  1. When you use a custom function you need to use content_transformer

removeURL

removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)

ds.corpus

ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))

stemCompletion 函数的目的是尝试完成一个词干https://en.wikipedia.org/wiki/Stemming 基于字典.词干需要是一个字符向量,字典可以是一个语料库.

The purpose of the function stemCompletion is to try to complete a stemmed word https://en.wikipedia.org/wiki/Stemming based on a dictionary. The stemmed words need to be a character vector and dictionary can be a corpus.

x

x <- c("compan", "entit", "suppl") stemCompletion(x, copy)

输出:

 compan       entit       suppl 

公司""供应"

用于创建文档术语矩阵的代码

# Data import
df.imp<- read.csv("data/Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)

##### Data Pre-Processing 

#install.packages("tm")
require(tm)  

ds.corpus<- Corpus(VectorSource(df.imp$Content))

ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))


stopwords.default<- stopwords("english")
stopWordsNotDeleted<- c("isn't" ,     "aren't" ,    "wasn't" ,    "weren't"   , "hasn't"    ,
                        "haven't" ,   "hadn't"  ,   "doesn't" ,   "don't"      ,"didn't"    ,
                        "won't"   ,   "wouldn't",   "shan't"  ,   "shouldn't",  "can't"     ,
                        "cannot"    , "couldn't"  , "mustn't", "but","no", "nor", "not", "too", "very")

stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )

tdm<- TermDocumentMatrix(ds.corpus)

完成词干词的示例

copy<- ds.corpus ## creating a copy to be used as a dictionary
x <- c("compan", "entit", "suppl")
stemCompletion(x, copy)