更新时间:2022-06-16 21:17:11
你需要改变两件事
当你使用自定义函数时,你需要使用 content_transformer
When you use a custom function you need to use content_transformer
removeURL
removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
ds.corpus
ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))
stemCompletion 函数的目的是尝试完成一个词干https://en.wikipedia.org/wiki/Stemming 基于字典.词干需要是一个字符向量,字典可以是一个语料库.
The purpose of the function stemCompletion is to try to complete a stemmed word https://en.wikipedia.org/wiki/Stemming based on a dictionary. The stemmed words need to be a character vector and dictionary can be a corpus.
x
x <- c("compan", "entit", "suppl") stemCompletion(x, copy)
输出:
compan entit suppl
公司""供应"
用于创建文档术语矩阵的代码
# Data import
df.imp<- read.csv("data/Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)
##### Data Pre-Processing
#install.packages("tm")
require(tm)
ds.corpus<- Corpus(VectorSource(df.imp$Content))
ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))
stopwords.default<- stopwords("english")
stopWordsNotDeleted<- c("isn't" , "aren't" , "wasn't" , "weren't" , "hasn't" ,
"haven't" , "hadn't" , "doesn't" , "don't" ,"didn't" ,
"won't" , "wouldn't", "shan't" , "shouldn't", "can't" ,
"cannot" , "couldn't" , "mustn't", "but","no", "nor", "not", "too", "very")
stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )
tdm<- TermDocumentMatrix(ds.corpus)
copy<- ds.corpus ## creating a copy to be used as a dictionary
x <- c("compan", "entit", "suppl")
stemCompletion(x, copy)