且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

预测班级或班级概率?

更新时间:2023-08-26 20:48:40

原则上&从理论上讲,很难软分类(即分别返回 class 和amp; 概率)是不同的方法,每种方法各有优缺点.缺点.例如,请考虑硬分类还是软分类?大利润统一机:

In principle & in theory, hard & soft classification (i.e. returning classes & probabilities respectively) are different approaches, each one with its own merits & downsides. Consider for example the following, from the paper Hard or Soft Classification? Large-margin Unified Machines:

基于边距的分类器在机器学习和分类问题的统计中都很流行.在众多分类器中,有些是 hard 分类器,而有些是 soft 分类器.软分类器显式估计类的条件概率,然后根据估计的概率执行分类.相反,硬分类器直接针对分类决策边界,而不产生概率估计.这两种类型的分类器基于不同的哲学,各有千秋.

Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers while some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits.

也就是说,实际上,当今使用的大多数分类器,包括随机森林(我能想到的唯一例外是SVM系列)实际上都是 soft 分类器:它们在底层实际产生的内容是一种类似于概率的量度,然后与隐式阈值(通常在二进制情况下默认为0.5)结合使用,从而给出诸如0/1True/False的硬类成员资格.

That said, in practice, most of the classifiers used today, including Random Forest (the only exception I can think of is the SVM family) are in fact soft classifiers: what they actually produce underneath is a probability-like measure, which subsequently, combined with an implicit threshold (usually 0.5 by default in the binary case), gives a hard class membership like 0/1 or True/False.

获得分类预测结果的正确方法是什么?

What is the right way to get the classified prediction result?

对于初学者来说,从概率到困难的类总是可能的,但事实并非如此.

For starters, it is always possible to go from probabilities to hard classes, but the opposite is not true.

通常来说,鉴于您的分类器实际上是 soft 的分类器,仅获取最终的硬分类(True/False)会使过程具有黑匣子"的味道,原则上应该是不可取的;直接处理产生的概率,并且(重要!)明确控制决策阈值应该是此处的首选方法.根据我的经验,这些都是新手往往会遗忘的精妙之处.例如,请从交叉验证线程分类概率阈值中考虑以下内容:

Generally speaking, and given the fact that your classifier is in fact a soft one, getting just the end hard classifications (True/False) gives a "black box" flavor to the process, which in principle should be undesirable; handling directly the produced probabilities, and (important!) controlling explicitly the decision threshold should be the preferable way here. According to my experience, these are subtleties that are often lost to new practitioners; consider for example the following, from the Cross Validated thread Classification probability threshold:

当您为新样本的每个类别输出概率时,练习的统计部分结束.选择一个阈值以将新观察值分类为1 vs. 0不再是统计的一部分.它是决定组件的一部分.

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

除了像上面这样的软"参数(非预期的双关语)之外,在某些情况下,您需要直接处理潜在的概率和阈值,即二进制分类中默认阈值为0.5的情况会使您误入歧途,特别是当您的课时不平衡时;请在看到较高的AUC却做出错误的预测数据不平衡(及其中的链接)作为这种情况的具体示例.

Apart from "soft" arguments (pun unintended) like the above, there are cases where you need to handle directly the underlying probabilities and thresholds, i.e. cases where the default threshold of 0.5 in binary classification will lead you astray, most notably when your classes are imbalanced; see my answer in High AUC but bad predictions with imbalanced data (and the links therein) for a concrete example of such a case.

说实话,我对您报告的H2O行为感到惊讶(我个人没有使用过),即输出的种类受输入的表示形式影响;事实并非如此,如果确实如此,我们可能会遇到设计不良的问题.例如,比较scikit-learn中的随机森林"分类器,它包括两种不同的方法, predict_proba ,分别获得硬分类和基础概率(并检查文档,很明显predict的输出基于概率估计(之前已经计算过)

To be honest, I am rather surprised by the behavior of H2O you report (I haven't use it personally), i.e. that the kind of the output is affected by the representation of the input; this should not be the case, and if it is indeed, we may have an issue of bad design. Compare for example the Random Forest classifier in scikit-learn, which includes two different methods, predict and predict_proba, to get the hard classifications and the underlying probabilities respectively (and checking the docs, it is apparent that the output of predict is based on the probability estimates, which have been computed already before).

如果概率是数字目标值的结果,那么在进行多类分类时如何处理?

If probabilities are the outcomes for numerical target values, then how do I handle it in case of a multiclass classification?

除了简单的阈值不再有意义之外,原则上这里没有新内容.再次,从随机森林 scikit-learn中的predict 文档:

There is nothing new here in principle, apart from the fact that a simple threshold is no longer meaningful; again, from the Random Forest predict docs in scikit-learn:

预测类别是具有最高平均概率估计的类别

the predicted class is the one with highest mean probability estimate

也就是说,对于3个类别(0, 1, 2),您得到的估计值为[p0, p1, p2](根据概率规则,元素加总为1),而预测的类别是概率最高的类别,例如[0.12, 0.60, 0.28]情况下为#1类.这是一个可重现的示例,其中包含3类虹膜数据集(用于GBM算法和R中,但基本原理相同).

That is, for 3 classes (0, 1, 2), you get an estimate of [p0, p1, p2] (with elements summing up to one, as per the rules of probability), and the predicted class is the one with the highest probability, e.g. class #1 for the case of [0.12, 0.60, 0.28]. Here is a reproducible example with the 3-class iris dataset (it's for the GBM algorithm and in R, but the rationale is the same).