且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用 R 文本分析进行词干分析

更新时间:2022-06-08 22:23:32

我们可以设置同义词列表并替换这些值.例如

We could set up a list of synonyms and replace those values. For example

synonyms <- list(
    list(word="account", syns=c("acount", "accounnt"))
)

这表示我们想用帐户"替换帐户"和帐户"(我假设我们在提取词干后这样做).现在让我们创建测试数据.

This says we want to replace "acount" and "accounnt" with "account" (i'm assuming we're doing this after stemming). Now let's create test data.

raw<-c("accounts", "account", "accounting", "acounting", 
     "acount", "acounts", "accounnt")

现在让我们定义一个转换函数,用主要同义词替换列表中的单词.

And now let's define a transformation function that will replace the words in our list with the primary synonym.

library(tm)
replaceSynonyms <- content_transformer(function(x, syn=NULL) { 
    Reduce(function(a,b) {
        gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a)}, syn, x)   
})

这里我们使用 content_transformer 函数来定义自定义转换.基本上我们只是做一个 gsub 来替换每个单词.然后我们可以在语料库中使用它

Here we use the content_transformer function to define a custom transformation. And basically we just do a gsub to replace each of the words. We can then use this on a corpus

tm <- Corpus(VectorSource(raw))
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, replaceSynonyms, synonyms)
inspect(tm)

我们可以看到所有这些值都根据需要转换为帐户".要添加其他同义词,只需将其他列表添加到主 synonyms 列表即可.每个子列表都应该有名称word"和syns".

and we can see all these values are transformed into "account" as desired. To add other synonyms, just add additional lists to the main synonyms list. Each sub-list should have the names "word" and "syns".