且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

了解Gensim LDA模型中的参数

更新时间:2023-02-27 08:43:11

我想知道您是否已经看到此页面?

I wonder if you have seen this page?

无论哪种方式,让我为您解释一些事情.该方法使用的文档数量很少(在经过Wikipedia大小的数据源训练后,效果会更好).因此,结果将是相当粗糙的,您必须意识到这一点.这就是为什么您不应该针对大量主题的原因(您选择了10个,在您的情况下可能明智地增加到20个).

Either way, let me explain a few things for you. The number of documents you use is small for the method (it works much better when trained on a data source of the size of Wikipedia). Therefore the results will be rather crude and you have to be aware of that. This is why you should not aim for a large number of topics (you chose 10 which could maybe go sensibly up to 20 in your case).

其他参数:

  • random_state-作为种子(如果您想精确地重复训练过程)

  • random_state - this serves as a seed (in case you wanted to repeat exactly the training process)

chunksize-一次要考虑的文档数(影响内存消耗)

chunksize - number of documents to consider at once (affects the memory consumption)

update_every -每update_every chunksize块(本质上,这是为了优化内存消耗)

update_every - update the model every update_every chunksize chunks (essentially, this is for memory consumption optimization)

passes-该算法应该遍历整个主体的次数

passes - how many times the algorithm is supposed to pass over the whole corpus

alpha-引用文档:

可以将

设置为显式数组=您选择的优先级.它也是 支持不对称"和自动"的特殊值:前者使用 固定归一化不对称1.0/topicno先于,后者学习 直接从您的数据中获取非对称先验.

can be set to an explicit array = prior of your choice. It also support special values of `‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.

  • per_word_topics-将其设置为True允许提取给定单词的最可能主题.设置培训过程的方式是将每个单词分配给一个主题.否则,将省略没有指示性的词. phi_value是引导该过程的另一个参数-它是一个单词是否被视为指示性单词的阈值.

  • per_word_topics - setting this to True allows for extraction of the most likely topics given a word. The training process is set in such a way that every word will be assigned to a topic. Otherwise, words that are not indicative are going to be omitted. phi_value is another parameter that steers this process - it is a threshold for a word treated as indicative or not.

    M中特别详细地描述了***训练过程参数. Hoffman等人,在线学习潜在的狄利克雷分配方法.

    有关训练过程或模型的内存优化,请参见此博客文章.

    For memory optimization of the training process or the model see this blog post.