且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用gensim库进行记忆有效的LDA训练

更新时间:2023-02-27 08:57:08

请考虑将您的corpus打包为可迭代的,并传递它而不是列表(生成器将不起作用).

Consider wrapping your corpus up as an iterable and passing that instead of a list (a generator will not work).

来自该教程:

class MyCorpus(object):
    def __iter__(self):
       for line in open(fname):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

corpus = MyCorpus()
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, 
                                      id2word=dictionary,
                                      num_topics=100,
                                      update_every=1,
                                      chunksize=10000,
                                      passes=1)

另外,Gensim还提供了几种易于使用的不同语料库格式,可以在 API参考中找到一个>.您可以考虑使用TextCorpus,它应该已经非常适合您的格式:

Additionally, Gensim has several different corpus formats readily available, which can be found in the API reference. You might consider using TextCorpus, which should fit your format nicely already:

corpus = gensim.corpora.TextCorpus(fname)
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, 
                                      id2word=corpus.dictionary, # TextCorpus can build the dictionary for you
                                      num_topics=100,
                                      update_every=1,
                                      chunksize=10000,
                                      passes=1)