且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用Gensim为LDA模型获取***主题数的***方法是什么?

更新时间:2022-11-11 19:19:29

尽管我不能特别评论Gensim,但可以考虑一些有关优化主题的一般建议.

Although I cannot comment on Gensim in particular I can weigh in with some general advice for optimising your topics.

正如您所述,使用对数似然法是一种方法.另一种选择是保留一组来自模型生成过程的文档,并在模型完成后推断主题,并检查是否有意义.

As you stated, using log likelihood is one method. Another option is to keep a set of documents held out from the model generation process and infer topics over them when the model is complete and check if it makes sense.

您可以尝试的另一种完全不同的方法是层次化Dirichlet流程,该方法可以在不指定的情况下动态地找到语料库中的主题数.

A completely different method you could try is a hierarchical Dirichlet process, this method can find the number of topics in the corpus dynamically without being specified.

关于如何***地指定参数和评估主题模型的论文很多,具体取决于您的经验水平,这些论文可能对您不利或对您不利:

There are many papers on how to best specify parameters and evaluate your topic model, depending on your experience level these may or may not be good for you:

重新思考LDA:为何如此重要,Wallach,HM,Mimno,D.和McCallum,答:

Rethinking LDA: Why Priors Matter, Wallach, H.M., Mimno, D. and McCallum, A.

主题模型的评估方法,Wallach HM,Murray,I.,Salakhutdinov,R.还有Dim Mimno.

Evaluation Methods for Topic Models, Wallach H.M., Murray, I., Salakhutdinov, R. and Mimno, D.

此外,这是有关分层Dirichlet流程的论文:

Also, here is the paper about the hierarchical Dirichlet process:

分级Dirichlet流程,Teh,YW,约旦,密西根州,比尔(M. Beal)和布莱(Blei)DM

Hierarchical Dirichlet Processes, Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.