更新时间:2023-02-27 08:57:08
请考虑将您的corpus
打包为可迭代的,并传递它而不是列表(生成器将不起作用).
Consider wrapping your corpus
up as an iterable and passing that instead of a list (a generator will not work).
来自该教程:
class MyCorpus(object):
def __iter__(self):
for line in open(fname):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())
corpus = MyCorpus()
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=100,
update_every=1,
chunksize=10000,
passes=1)
另外,Gensim还提供了几种易于使用的不同语料库格式,可以在 API参考中找到一个>.您可以考虑使用TextCorpus
,它应该已经非常适合您的格式:
Additionally, Gensim has several different corpus formats readily available, which can be found in the API reference. You might consider using TextCorpus
, which should fit your format nicely already:
corpus = gensim.corpora.TextCorpus(fname)
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=corpus.dictionary, # TextCorpus can build the dictionary for you
num_topics=100,
update_every=1,
chunksize=10000,
passes=1)