且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用r中的ROCR软件包绘制ROC曲线*仅带有分类列联表*

更新时间:2022-11-28 22:57:30

您无法使用单个列联表生成完整的ROC曲线,因为列联表仅提供单个灵敏度/特异度对(对于用于生成列联表的任何预测截止点)。

You cannot generate the full ROC curve with a single contingency table because a contingency table provides only a single sensitivity/specificity pair (for whatever predictive cutoff was used to generate the contingency table).

如果您生成了许多具有不同截止值的列联表,则可以估算ROC曲线(基本上,它将是列联表中灵敏度/特异性值之间的线性插值)。例如,让我们考虑使用逻辑回归预测虹膜数据集中的花朵是否为杂色:

If you had many contingency tables that were generated with different cutoffs, you would be able to approximate the ROC curve (basically it will be a linear interpolation between the sensitivity/specificity values in your contingency tables). As an example, let's consider predicting whether a flower is versicolor in the iris dataset using logistic regression:

iris$isv <- as.numeric(iris$Species == "versicolor")
mod <- glm(isv~Sepal.Length+Sepal.Width, data=iris, family="binomial")

我们可以使用标准的 ROCR 代码来计算此模型的ROC曲线:

We could use the standard ROCR code to compute the ROC curve for this model:

library(ROCR)
pred1 <- prediction(predict(mod), iris$isv)
perf1 <- performance(pred1,"tpr","fpr")
plot(perf1)

现在让我们假设,除了 mod 以外,我们还有所有具有预测临界值的列联表:

Now let's assume that instead of mod all we have is contingency tables with a number of cutoffs values for predictions:

tables <- lapply(seq(0, 1, .1), function(x) table(iris$isv, factor(predict(mod, type="response") >= x, levels=c(F, T))))

# Predict TRUE if predicted probability at least 0
tables[[1]]
#     FALSE TRUE
#   0     0  100
#   1     0   50

# Predict TRUE if predicted probability at least 0.5
tables[[6]]
#     FALSE TRUE
#   0    86   14
#   1    29   21

# Predict TRUE if predicted probability at least 1
tables[[11]]
#     FALSE TRUE
#   0   100    0
#   1    50    0

由于截止值的增加,从一个表到下一个表的某些预测从TRUE变为FALSE,通过比较连续表的第1列,我们可以确定其中哪些代表真正的负面预测和错误的负面预测。通过迭代我们的列联表,我们可以创建假的预测值/结果对,并将其传递给ROCR,以确保我们匹配每个列表的敏感性/特异性。

From one table to the next some predictions changed from TRUE to FALSE due to the increased cutoff, and by comparing column 1 of the successive table we can determine which of these represent true negative and false negative predictions. Iterating through our ordered list of contingency tables we can create fake predicted value/outcome pairs that we can pass to ROCR, ensuring that we match the sensitivity/specificity for each contingency table.

fake.info <- do.call(rbind, lapply(1:(length(tables)-1), function(idx) {
  true.neg <- tables[[idx+1]][1,1] - tables[[idx]][1,1]
  false.neg <- tables[[idx+1]][2,1] - tables[[idx]][2,1]
  if (true.neg <= 0 & false.neg <= 0) {
    return(NULL)
  } else {
    return(data.frame(fake.pred=idx,
                      outcome=rep(c(0, 1), times=c(true.neg, false.neg))))
  }
}))

现在我们可以像往常一样将伪造的预测传递给ROCR了: / p>

Now we can pass the faked predictions to ROCR as usual:

pred2 <- prediction(fake.info$fake.pred, fake.info$outcome)
perf2 <- performance(pred2,"tpr","fpr")
plot(perf2)

基本上,我们所做的是对ROC曲线上的点进行线性插值。如果您有很多截止点的列联表,则可以更接近真实的ROC曲线。如果您没有很宽的分界线,那么您就无法希望准确地再现完整的ROC曲线。

Basically what we have done is a linear interpolation of the points that we do have on the ROC curve. If you had contingency tables for many cutoffs you could more closely approximate the true ROC curve. If you don't have a wide range of cutoffs you can't hope to accurately reproduce the full ROC curve.