且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在Gensim中了解LDA转换的语料库

更新时间:2023-02-27 09:19:54

您对gensimLDA输出的理解是正确的.不过,您需要记住的是LDA[corpus]仅会输出超过特定阈值(在初始化模型时设置)的主题.

Your understanding of the output of LDA from gensim is correct. What you need to remember though is that LDA[corpus] will only output topics that exceed a certain threshold (set when you initialise the model).

document belongs to ONE topic问题是您需要自己做出决定的问题. LDA为您提供的每个文档的主题分布*.然后,您需要确定一个文档(例如,具有某个主题的50%)是否足以使该文档属于该主题.

The document belongs to ONE topic issue is one you need to make a decision about on your own. LDA gives you a distribution over the topics for each document you feed into it*. You need to then make a decision whether a document having (for instance) 50% of a topic is enough for that document to belong to said topic.

(*),您必须牢记LDA[corpus]只会向您显示超过阈值的那些,而不是整个分布.您也可以使用

(*) again you have to keep in mind that LDA[corpus] will only show you those ones that exceed a threshold, not the whole distribution. You can access the whole distribution as well using

theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]