
且构网 - 分享程序员编程开发的那些事


更新时间:2021-08-08 02:34:44



If you look at the interface for sklearn stages, the methods are of the form:

fit(X, y, other_stuff)



That is, they work on the entire dataset, and can't do incremental learning on streams (or chunked streams) of data.


Moreover, fundamentally, some of the algorithms are not amenable to this. Consider for example your stage

("SCALE", Normalizer()),


Presumably, this normalizes using mean and/or variance. Without seeing the entire dataset, how can it know these things? It must therefore wait for the entire input before operating, and hence can't be run in parallel with the stages after it. Most (if not nearly all) stages are like that.


However, in some cases, you still can use multicores with sklearn.

  1. 某些阶段具有 n_jobs参数.像这样的阶段相对于其他阶段按顺序使用,但是可以并行化其中的工作.

  1. Some stages have an n_jobs parameter. Stages like this use sequentially relative to other stages, but can parallelize the work within.

在某些情况下,您可以滚动自己自己(近似)其他阶段的并行版本.例如,在给定任何回归阶段的情况下,您可以将其包装在一个阶段中,该阶段将数据随机分成 n 个部分,并行学习各部分,然后输出所有回归器的平均值的回归器. YMMV.

In some cases you can roll your own (approximate) parallel versions of other stages. E.g., given any regressor stage, you can wrap it in a stage that randomly chunks your data into n parts, learns the parts in parallel, and outputs a regressor that is the average of all the regressors. YMMV.