且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

是否可以逐步训练 sklearn 模型(例如 SVM)?

更新时间:2023-12-02 17:37:34

走到另一个极端并逐个实例训练其实没有必要(更不用说高效了);您正在寻找的实际上称为增量在线学习,它可以在 scikit-learn 的 SGDClassifier 用于线性 SVM和逻辑回归,其中确实包含a partial_fit 方法.

It is not really necessary (let alone efficient) to go to the other extreme and train instance by instance; what you are looking for is actually called incremental or online learning, and it is available in scikit-learn's SGDClassifier for linear SVM and logistic regression, which indeed contains a partial_fit method.

这是一个包含虚拟数据的简单示例:

Here is a quick example with dummy data:

import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
clf = linear_model.SGDClassifier(max_iter=1000, tol=1e-3)

clf.partial_fit(X, Y, classes=np.unique(Y))

X_new = np.array([[-1, -1], [2, 0], [0, 1], [1, 1]])
Y_new = np.array([1, 1, 2, 1])
clf.partial_fit(X_new, Y_new)

losspenalty 参数的默认值(分别为 'hinge''l2')这些是 LinearSVC,所以上面的代码本质上是增量地拟合一个带有 L2 正则化的线性 SVM 分类器;这些设置当然可以更改 - 查看文档了解更多详情.

The default values for the loss and penalty arguments ('hinge' and 'l2' respectively) are these of a LinearSVC, so the above code essentially fits incrementally a linear SVM classifier with L2 regularization; these settings can of course be changed - check the docs for more details.

有必要在第一次调用中包含 classes 参数,它应该包含问题中的所有现有类(即使其中一些可能不存在于某些部分拟合中);它可以在 partial_fit 的后续调用中省略 - 同样,请参阅链接的文档以获取更多详细信息.

It is necessary to include the classes argument in the first call, which should contain all the existing classes in your problem (even though some of them might not be present in some of the partial fits); it can be omitted in subsequent calls of partial_fit - again, see the linked documentation for more details.