更新时间:2021-06-30 23:18:13
它应该是表示为单词袋"的语料库.或者,是的,术语计数列表.
It should be a corpus represented as a "bag of words". Or, yes, lists of term counts.
正确的格式是Gensim网页上的第一个教程中定义的corpus
格式. >(这些非常有用).
The correct format is that of the corpus
defined in the first tutorial on the Gensim webpage (these are really useful).
也就是说,如果您具有Radim教程中定义的dictionary
和以下文档,
Namely, if you have a dictionary
as defined in Radim's tutorial, and the following documents,
doc1 = ['big', 'data', 'technique', 'lots', 'of', 'cash']
doc2 = ['this', 'document', 'has', 'words']
docs = [doc1, doc2]
然后,您的语料库(用于LDA)应该是以下形式的元组列表的可迭代对象(例如列表):(dictKey, count)
,其中dk
指术语的字典键,并计数是它在文档中出现的次数.
then your corpus (for use with LDA) should be an iterable object (such as a list) of lists of tuples of the form: (dictKey, count)
, where dk
refers to the dictionary key of a term, and count is the number of times it occurs in the document. This is done for you with
corpus = [dictionary.doc2bow(doc) for doc in docs]
该doc2bow
函数的意思是文档到单词袋".
That doc2bow
function means "document to bag of words".