且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

XGBoost predict_proba 推理性能慢

更新时间:2023-12-02 14:38:58

为什么 xgboost 这么慢?":XGBClassifier() 是 XGBoost 的 scikit-learn API(参见例如 https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier 了解更多详情).如果您直接调用该函数(而不是通过 API),它会更快.为了比较这两个函数的性能,直接调用每个函数是有意义的,而不是直接调用一个函数和通过 API 调用一个函数.下面是一个例子:

"Why is xgboost so slow?": XGBClassifier() is the scikit-learn API for XGBoost (see e.g. https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier for more details). If you call the function directly (not through an API) it will be faster. To compare the performance of the two functions it makes sense to call each function directly, instead of calling one function directly and one function through an API. Here is an example:

# benchmark_xgboost_vs_sklearn.py
# Adapted from `xgboost_test.py` by Jacob Schreiber 
# (https://gist.github.com/jmschrei/6b447aada61d631544cd)

"""
Benchmarking scripts for XGBoost versus sklearn (time and accuracy)
"""

import time
import random
import numpy as np
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier

random.seed(0)
np.random.seed(0)

def make_dataset(n=500, d=10, c=2, z=2):
    """
    Make a dataset of size n, with d dimensions and m classes,
    with a distance of z in each dimension, making each feature equally
    informative.
    """

    # Generate our data and our labels
    X = np.concatenate([np.random.randn(n, d) + z*i for i in range(c)])
    y = np.concatenate([np.ones(n) * i for i in range(c)])

    # Generate a random indexing
    idx = np.arange(n*c)
    np.random.shuffle(idx)

    # Randomize the dataset, preserving data-label pairing
    X = X[idx]
    y = y[idx]

    # Return x_train, x_test, y_train, y_test
    return X[::2], X[1::2], y[::2], y[1::2]

def main():
    """
    Run SKLearn, and then run xgboost,
    then xgboost via SKLearn XGBClassifier API wrapper
    """

    # Generate the dataset
    X_train, X_test, y_train, y_test = make_dataset(10, z=100)
    n_estimators=5
    max_depth=5
    learning_rate=0.17

    # sklearn first
    tic = time.time()
    clf = GradientBoostingClassifier(n_estimators=n_estimators,
        max_depth=max_depth, learning_rate=learning_rate)
    clf.fit(X_train, y_train)
    print("SKLearn GBClassifier: {}s".format(time.time() - tic))
    print("Acc: {}".format(clf.score(X_test, y_test)))
    print(y_test.sum())
    print(clf.predict(X_test))

    # Convert the data to DMatrix for xgboost
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest  = xgb.DMatrix(X_test, label=y_test)
    # Loop through multiple thread numbers for xgboost
    for threads in 1, 2, 4:
        # xgboost's sklearn interface
        tic = time.time()
        clf = xgb.XGBModel(n_estimators=n_estimators, max_depth=max_depth,
            learning_rate=learning_rate, nthread=threads)
        clf.fit(X_train, y_train)
        print("SKLearn XGBoost API Time: {}s".format(time.time() - tic))
        preds = np.round( clf.predict(X_test) )
        acc = 1. - (np.abs(preds - y_test).sum()  / y_test.shape[0])
        print("Acc: {}".format( acc ))
        print("{} threads: ".format( threads ))
        tic = time.time()
        param = {
                  'max_depth' : max_depth,
                        'eta' : 0.1,
                      'silent': 1,
                   'objective':'binary:logistic',
                     'nthread': threads
                }
        bst = xgb.train( param, dtrain, n_estimators,
            [(dtest, 'eval'), (dtrain, 'train')] )
        print("XGBoost (no wrapper) Time: {}s".format(time.time() - tic))
        preds = np.round(bst.predict(dtest) )
        acc = 1. - (np.abs(preds - y_test).sum() / y_test.shape[0])
        print("Acc: {}".format(acc))

if __name__ == '__main__':
    main()

总结结果:

sklearn.ensemble.GradientBoostingClassifier()

sklearn.ensemble.GradientBoostingClassifier()

  • 时间:0.003237009048461914s
  • 准确度:1.0

sklearn xgboost API 包装器 XGBClassifier()

sklearn xgboost API wrapper XGBClassifier()

  • 时间:0.3436141014099121s
  • 准确度:1.0

XGBoost(无包装)xgb.train()

XGBoost (no wrapper) xgb.train()

  • 时间:0.0028612613677978516s
  • 准确度:1.0