且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将术语文档矩阵传递给Gensim LDA模型

更新时间:2023-02-27 08:56:50

我相信Gensim使用几乎相同的结构来表示一袋单词语料库,但我认为默认字典或numpy数组不兼容.Gensim的API列出了一些可以容纳各种格式的语料库阅读器",但是这些似乎是为从其他工具包导入数据而构建的.因此,就您而言,也许最简单的解决方案是使用矩阵和字典作为分隔字符串的列表来重建文档.然后将您的列表转换为Gensim的单词语料库,最后转换为LDA,如

I believe Gensim uses pretty much the same structure to represent a bag of words corpus, but I don't think a default dictionary or numpy array would be compatible. Gensim's API lists a few "corpusreaders" that can accommodate various formats, but those seem to be built for importing data from other tool kits. So maybe in your case the easiest solution would be to reconstruct the documents using your matrix and dictionary as a list of separated strings. Then convert your list to Gensim's bag of word corpus and finally to LDA as shown in the tutorials.

这种方法的另一个好处是,您可以应用Gensim的预处理功能并以低频/高频过滤单词.

This approach has the added benefit that you can apply Gensim's preprocessing functions and filter words with low/high frequencies.