且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用gensim进行LDA主题建模的python IndexError

更新时间:2023-02-27 09:28:37

这是由于使用ID相同的语料库词典引起的-字映射.如果您修剪字典并在错误的时间调用 dictionary.compactify(),则可能会发生这种情况.

This is caused by using a corpus and dictionary that don't have the same id-to-word mapping. It can happen if you prune your dictionary and call dictionary.compactify() at the wrong time.

一个简单的例子将使它变得清晰.我们来做个字典:

A simple example will make it clear. Let's make a dictionary:

from gensim.corpora.dictionary import Dictionary
documents = [
    ['here', 'is', 'one', 'document'],
    ['here', 'is', 'another', 'document'],
]
dictionary = Dictionary()
dictionary.add_documents(documents)

此词典现在为这些单词提供条目,并将它们映射到整数id.将文档转换为(id,count)元组的向量很有用(在将它们传递到模型之前我们要这样做):

This dictionary now has entries for these words and maps them to integer id's. It's useful to turn documents into vectors of (id, count) tuples (which we'd want to do before passing them into a model):

vectorized_corpus = [dictionary.doc2bow(doc) for doc in corpus]

有时您需要更改字典.例如,您可能要删除非常罕见或非常常见的单词:

Sometimes you'll want to alter your dictionary. For example, you might want to remove very rare, or very common words:

dictionary.filter_extremes(no_below=2, no_above=0.5, keep_n=100000)
dictionary.compactify()

删除单词会在字典中造成空隙,但是调用 dictionary.compactify()会重新分配ID来填补空隙.但这意味着我们上面的 vectorized_corpus 不再使用与 dictionary 相同的ID,如果将它们传递给模型,我们将得到一个 IndexError .

Removing words creates gaps in the dictionary, but calling dictionary.compactify() re-assigns ids to fill in the gaps. But that means our vectorized_corpus from above doesn't use the same id's as the dictionary any more, and if we pass them into a model, we'll get an IndexError.

解决方案:使用字典之后进行更改并调用 dictionary.compactify()

Solution: make your vector representation using the dictionary after making changes and calling dictionary.compactify()!