且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Gensim doc2vec文件流训练性能较差

更新时间:2023-02-20 08:11:10

大多数用户在自己的循环中尝试管理alpha&的循环中不应多次调用train().自己迭代.错误做起来太容易了.

Most users should not be calling train() more than once in their own loop, where they try to manage the alpha & iterations themselves. It is too easy to do it wrong.

特别是,您在循环中调用train()的代码做错了.无论您以此代码为模型的任何在线资源或教程,都应该停止咨询,因为它具有误导性或已过时. (与gensim捆绑在一起的笔记本是可以作为任何代码基础的更好的示例.)

Specifically, your code where you call train() in a loop is doing it wrong. Whatever online source or tutorial you modeled this code on, you should stop consulting, as it's misleading or outdated. (The notebooks bundled with gensim are better examples on which to base any code.)

更具体地说:您的循环代码实际上是对数据进行100次传递,对您的外部循环进行20次传递,然后对每个train()调用的默认d2v.iter进行5次.并且您的第一个train()调用正在将有效alpha从0.025平滑衰减到0.00025,减少了100倍.但是随后您的下一个train()调用将0.0248的固定alpha用于5次传递.然后是0.0246,依此类推,直到最后一个循环在alpha=0.0212处通过5次–甚至还没有起始值的80%.也就是说,在您的训练开始时便已达到最低的Alpha值.

Even more specifically: your looping code is actually doing 100 passes over the data, 20 of your outer loops, then the default d2v.iter 5 times each call to train(). And your first train() call is smoothly decaying the effective alpha from 0.025 to 0.00025, a 100x reduction. But then your next train() call uses a fixed alpha of 0.0248 for 5 passes. Then 0.0246, etc, until your last loop does 5 passes at alpha=0.0212 – not even 80% of the starting value. That is, the lowest alpha will have been reached early in your training.

除了指定corpus_file的方式(而不是可迭代的语料库)以外,完全相同地调用这两个选项.

Call the two options exactly the same except for the way the corpus_file is specified, instead of an iterable corpus.

您应该从两种语料库形式中获得相似的结果. (如果您有一个可重现的测试用例,其中相同的语料库获得的质量结果非常不同,并且没有其他错误,则可能值得将其报告给gensim作为错误.)

You should get similar results from both corpus forms. (If you had a reproducible test case where the same corpus gets very different-quality results, and there wasn't some other error, that could be worth reporting to gensim as a bug.)

如果两种方法的结果都不如错误地管理train()alpha时好,那可能是因为您没有进行相当数量的总体培训.

If the results for both aren't as good as when you were managing train() and alpha wrongly, it would likely be because you aren't doing a comparable amount of total training.