且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

R:在 glm() 中的逻辑回归中预测 (0,1)

更新时间:2021-12-29 10:01:35

我还没有尝试挖掘我基于 Gelman 和 Hill (2006) 编写的预测代码,我似乎记得使用过模拟.我仍然打算这样做.在我有限的经验中,你的问题的一个方面似乎是独一无二的,因为我习惯于预测单个观察(在这种情况下,单个学生参加单个测试).但是,您似乎想要预测两组预测之间的差异.换句话说,您想预测如果提供 5 门简单考试而不是 5 门硬考试,将会有多少学生通过.

I have not yet tried to dig out my code for prediction that I wrote based on Gelman and Hill (2006) who, I seem to recall used simulation. I still intend to do that. One aspect of your question that seemed unique in my limited experience was that I was accustomed to predicting for a single observation (in this case a single student taking a single test). You, however, seem to want to predict a difference between two sets of predictions. In other words, you want to predict how many more students will pass if given a set of 5 easy exams rather than a set of 5 hard exams.

我不确定 Gelman 和 Hill (2006) 是否涵盖了这一点.您似乎也想用常客方法来做到这一点.

I am not sure whether Gelman and Hill (2006) covered that. You also seem to want to do this with a frequentist approach.

我在想,如果您可以对单个观察进行预测,以便您对每个观察都有一个置信区间,那么也许您可以估计每个组内通过的加权平均概率并减去两个加权平均值.Delta 方法可用于估计加权平均值及其差异的置信区间.

I am thinking that if you can predict for a single observation, so that you have a confidence interval for each observation, then perhaps you can estimate a weighted average probability of passing within each group and subtract the two weighted averages. The delta method could be used to estimate a confidence interval on the weighted averages and on their difference.

可能必须假设预测观测值之间的协方差为 0 才能实施该方法.

Covariance among predicted observations might have to be assumed to be 0 to implement that approach.

如果假设协方差为 0 不令人满意,那么也许贝叶斯方法会更好.同样,我只熟悉预测单个观察.使用贝叶斯方法,我通过包括自变量而不是因变量来预测单个观察,以便预测观察.我想你可以在同一个贝叶斯运行中预测每个观察(预测每个学生的高和低).每组通过测试的加权平均值和加权平均值的差异是派生参数,我怀疑可以直接包含在贝叶斯逻辑回归的代码中.然后,您将对通过每组测试的概率以及通过每组测试的概率差异进行点估计和方差估计.如果您想要通过每组测试的学生人数的差异,也许这也可以作为派生参数包含在贝叶斯代码中.

If assuming a covariance of 0 is not satisfactory then perhaps a Bayesian approach would be better. Again, I am only familiar with predicting for a single observation. With a Bayesian approach I have predicted a single observation by including the independent variables, but not the dependent variable, for the observation to be predicted. I suppose you could predict for every observation in the same Bayesian run (predict each student in HIGH and in LOW). The weighted averages of passing tests for each group and the difference in weighted averages are derived parameters and I suspect could be included directly in the code for the Bayesian logistic regression. Then you would have your point estimate and estimate of variance for probability of passing each group of tests and for the difference in probability of passing each group of tests. If you want the difference in the number of students passing each group of tests, perhaps that could be included in the Bayesian code as a derived parameter also.

我意识到到目前为止,这个答案的对话性比预期的要多.我只是制定了尝试的策略,而没有时间尝试实施这些策略.提供所有 R 和 WinBUGS 代码来实现这两个提议的策略可能需要我几天时间.(WinBUGS 或 OpenBUGS 可以从 R 中调用.)我会在继续的过程中将代码附加到这个答案中.如果有人认为我提出的策略和/或即将发布的代码不正确,我希望他们能随时指出我的错误并提供更正.

I realize this answer, so far, has been more conversational than might be desired. I am simply mapping out strategies to attempt without having had the time yet to try implementing those strategies. Providing all of the R and WinBUGS code to implement both proposed strategies might take me a few days. (WinBUGS or OpenBUGS can be called from within R.) I will append the code to this answer as I go along. If anyone deems my proposed strategies, and/or forthcoming code, incorrect I hope they will feel free to point out my errors and offer corrections.

编辑

以下是生成虚假数据并使用频率论和贝叶斯方法分析该数据的代码.我还没有添加代码来实现上面的预测思路.我会在接下来的 1-2 天内尝试添加贝叶斯预测代码.我只使用了三个测试而不是五个.下面的代码编写方式,您可以将学生人数 n 更改为可以分为 6 个相等整数的任何非零数.

Below is code that generates fake data and analyzes that data using a frequentist and Bayesian approach. I have not yet added the code to implement the above ideas for prediction. I will try to add the Bayesian prediction code in the next 1-2 days. I only used three tests instead of five. The way the code is written below you can change the number of students, n, to any non-zero number that can be divided into 6 equal whole numbers.

# Bayesian_logistic_regression_June2012.r
# June 24, 2012

library(R2WinBUGS)
library(arm)
library(BRugs)

set.seed(3234)


# create fake data for n students and three tests

n <- 1200

# create factors for n/6 students in each of 6 categories

gender <- c(rep(0, (n/2)), rep(1, (n/2)))
test2  <- c(rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)),
            rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)))
test3  <- c(rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)),
            rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)))

# assign slopes to factors

B0      <-  0.4
Bgender <- -0.2
Btest2  <-  0.6
Btest3  <-  1.2

# estimate probability of passing test

p.pass <- (     exp(B0 + Bgender * gender + 
                         Btest2  * test2  + 
                         Btest3  * test3) /
           (1 + exp(B0 + Bgender * gender +
                         Btest2  * test2  + 
                         Btest3  * test3)))

# identify which students passed their test, 0 = fail, 1 = pass

passed   <- rep(0, n)
r.passed <- runif(n,0,1)
passed[r.passed <= p.pass] = 1

# use frequentist approach in R to estimate probability
# of passing test

m.freq <- glm(passed ~ as.factor(gender) +
                       as.factor(test2)  +
                       as.factor(test3)  , 
                       family = binomial)
summary(m.freq)

# predict(m.freq, type = "response")


# use OpenBUGS to analyze same data set

# Define model

sink("Bayesian.logistic.regression.txt")
cat("
model {

# Priors

 alpha ~ dnorm(0,0.01)
 bgender ~ dnorm(0,0.01)
 btest2 ~ dnorm(0,0.01)
 btest3 ~ dnorm(0,0.01)

# Likelihood

 for (i in 1:n) {
    passed[i] ~ dbin(p[i], 1)
    logit(p[i]) <- (alpha + bgender * gender[i] +
                            btest2  * test2[i]  +
                            btest3  * test3[i])
 }

# Derived parameters

 p.g.t1 <- exp(alpha) / (1 + exp(alpha))
 p.b.t1 <- exp(alpha + bgender) / (1 + exp(alpha + bgender))

 p.g.t2 <- (    exp(alpha +           btest2) / 
           (1 + exp(alpha +           btest2)))
 p.b.t2 <- (    exp(alpha + bgender + btest2) / 
           (1 + exp(alpha + bgender + btest2)))

 p.g.t3 <- (    exp(alpha +           btest3) / 
           (1 + exp(alpha +           btest3)))
 p.b.t3 <- (    exp(alpha + bgender + btest3) / 
           (1 + exp(alpha + bgender + btest3)))

}

", fill = TRUE)
sink()

my.data <- list(passed = passed, 
                gender = gender,
                test2  = test2,
                test3  = test3, 
                n      = length(passed))

# Inits function

inits <- function(){ list(alpha   = rlnorm(1), 
                          bgender = rlnorm(1),
                          btest2  = rlnorm(1),
                          btest3  = rlnorm(1)) }

# Parameters to estimate

params <- c("alpha", "bgender", "btest2", "btest3", 
            "p.g.t1", "p.b.t1", "p.g.t2", "p.b.t2",
            "p.g.t3", "p.b.t3")

# MCMC settings

nc <- 3
ni <- 2000
nb <- 500
nt <- 2

# Start Gibbs sampling

out <- bugs(data = my.data, inits = inits,
parameters.to.save = params, 
"c:/users/Mark W Miller/documents/Bayesian.logistic.regression.txt",
program = 'OpenBUGS', 
n.thin = nt, n.chains = nc, 
n.burnin = nb, n.iter = ni, debug = TRUE)

print(out, dig = 5)

在我尝试实施加权平均预测方法之前,我想说服自己它可能有效.所以我编写了以下代码,这似乎表明它可能:

Before I attempt to implement the weighted-average approach to prediction I wanted to convince myself that it might work. So I ginned up the following code, which seems to suggest it may:

# specify number of girls taking each test and
# number of boys taking each test

g.t1 <- rep(0,400)
b.t1 <- rep(0,120)
g.t2 <- rep(0,1200)
b.t2 <- rep(0,50)
g.t3 <- rep(0,1000)
b.t3 <- rep(0,2000)

# specify probability of individuals in each of the
# 6 groups passing their test

p.g1.t1 <- 0.40
p.b1.t1 <- 0.30
p.g1.t2 <- 0.60
p.b1.t2 <- 0.50
p.g1.t3 <- 0.80
p.b1.t3 <- 0.70

# identify which individuals in each group passed their test

g.t1[1:(p.g1.t1 * length(g.t1))] = 1
sum(g.t1)

b.t1[1:(p.b1.t1 * length(b.t1))] = 1
sum(b.t1)

g.t2[1:(p.g1.t2 * length(g.t2))] = 1
sum(g.t2)

b.t2[1:(p.b1.t2 * length(b.t2))] = 1
sum(b.t2)

g.t3[1:(p.g1.t3 * length(g.t3))] = 1
sum(g.t3)

b.t3[1:(p.b1.t3 * length(b.t3))] = 1
sum(b.t3)

# determine the weighted average probability of passing
# on test day for all individuals as a class

wt.ave.p <- ((p.g1.t1 * length(g.t1) + p.b1.t1 * length(b.t1) +
 p.g1.t2 * length(g.t2) + p.b1.t2 * length(b.t2) +
 p.g1.t3 * length(g.t3) + p.b1.t3 * length(b.t3) ) / 

 (length(g.t1) + length(b.t1) + length(g.t2) + 
  length(b.t2) + length(g.t3) + length(b.t3)))

wt.ave.p

# determine the expected number of individuals passing
# their test in the class as a whole

exp.num.pass <- wt.ave.p *  (length(g.t1) + length(b.t1) +
                             length(g.t2) + length(b.t2) +
                             length(g.t3) + length(b.t3))
exp.num.pass

# determine the number of individuals passing

num.passing <- (sum(g.t1) + sum(b.t1) + 
                sum(g.t2) + sum(b.t2) + 
                sum(g.t3) + sum(b.t3) )
num.passing

# the expected number of students passing, exp.num.pass,
# should equal the observed number of students passing,
# num.passing regardless of the number of students in each
# group and regardless of the probability of passing a 
# given test, within rounding error

identical(round(exp.num.pass), round(num.passing)) 

希望在接下来的几天里我可以尝试将预测代码添加到上述贝叶斯代码中.

Hopefully in the next couple of days I can try adding the prediction code to the above Bayesian code.

编辑 - 2012 年 6 月 27 日

EDIT - June 27, 2012

我没有忘记这件事.相反,我遇到了几个问题:

I have not forgotten about this. Rather, I have encountered several problems:

  1. 使用逻辑回归可以预测:a) 给定组中的学生通过测试的概率 p,以及 b) 给定学生参加测试的结果(0 或 1).然后对所有 0 和 1 求平均值.我不确定要使用其中的哪一个.预测 p 的点估计和 SD 与已知测试结果的估计 p 相同.预测的 0 和 1 的平均值的点估计略有不同,平均 0 和 1 的 SD 大得多.我相信我想要 b,预测的 0 和 1 的平均值.但是,我正在尝试检查各种网站和书籍以确定.Collett (1991) 有一个不使用计算机代码的工作示例,但该工作示例包含六个变量,其中包括 2 个交互,我在让我的贝叶斯估计与她的频率论估计相匹配时遇到了一些麻烦.

  1. With logistic regression it is possible to predict: a) the probability, p, that students in a given group pass a test and b) the outcome of a given student taking a test (0 or 1). All of the 0's and 1's are then averaged. I am not sure which of these to use. The point estimate and SD of the predicted p is identical to the estimated p for known test outcomes. The point estimate of the average of the predicted 0's and 1's is a little different and the SD of the averaged 0's and 1's is much larger. I believe I want b, the average of the predicted 0's and 1's. However, I am attempting to examine various websites and books to be sure. Collett (1991) has a worked example that does not employ computer code, but that worked example includes a half-dozen variables including 2 interactions and I am having a little trouble getting my Bayesian estimates to match her frequentist estimates.

由于有大量派生参数,程序需要很长时间才能运行.

With lots of derived parameters the program is taking a long time to run.

显然,即使没有预测代码,OpenBUGS 也经常崩溃,我相信.我不确定这是因为我做错了什么,还是因为最近版本的 R 或最近版本的 R 包的变化,或者因为我试图用 64 位 R 或其他东西运行代码否则.

Apparently OpenBUGS has been crashing frequently, I believe, even without prediction code. I am not sure whether that is because of something I am doing wrong or because of changes in the recent versions of R or changes in recent versions of R packages or maybe because I am trying to run the code with a 64-bit R or something else.

我会尽快发布预测代码,但上述所有问题都让我慢了下来.

I will try to post the prediction code soon, but all of the above issues have slowed me down.