Logistic Regression II

目录

r语言-线性回归模型

Donner Party: The Data

In 1846, a group of 87 people (calling themselves the Donner Party) were heading to California from Illinois. The leaders were stranded in the Sierra Nevada and were stranded there through the winter. Many people died due to the cold and lack of food.

1
2
x <- read.csv("Donner.csv", as.is = TRUE)
head(x)

Let’s polish up the Gender variable a bit for clarity.

1
x$Gender <- factor(ifelse(x$Gender == 1, "Male", "Female"))

Modeling Survival

Do Females Have a Higher Rate of Survival?

1
table(x$Gender, x$Survived)

We might want to study whether or not females are better able to survive through the harsh conditions. How do we answer this question?

1
2
m1 <- glm(Survived ~ Gender, data=x, family=binomial)
summary(m1)

We haven’t talked about where our p-values come from, but you can see from the summary that Gender is significant in predicting the probability of survival.

Interpretation of this coefficient:

Males were on average $e^{-1.28}=0.28$ times as likely as women were to
survive.

This might seem odd. Logistic regression is meant to fit an S-shaped curve
for the probability of survival here, but we have a categorical predictor. How does
this work? Essentially, by passing in a categorical predictor as a factor,
R created a dummy variable that equals 1 when Gender == "Male" and 0
otherwise. So if we were to look at a plot of GenderMale on the x-axis
against Survived on the y-axis, logistic regression is essentially still
fitting an S-shaped curve, but with only two points of interest, when $x=0$ and
when $x=1$.

Is Age Associated with Survival?

Suppose we now consider a different question, studying whether or not Age can help us predict survival. We need to first transform Age into a quantitative variable; it is currently character type because there is an asterisk somewhere, standing in for a missing value.

1
2
3
class(x$Age)
x$Age
x$Age <- as.numeric(x$Age)

Your book ignores any assumptions for logistic regression, but that doesn’t mean there aren’t any. The linearity assumption requires that the log-odds be a linear function of the predictor variable. We need to be able to estimate log odds for various age groups. (The reason why it doesn’t make sense to estimate log odds for each age is that we simply don’t have enough data at each age.) To start,
we will want to create a categorical version of Age (using equal sized bins).

1
x$AgeGrp <- cut(x$Age, 6)

Let’s practice checking the linearity assumption (although in general cases, we won’t bother).

1
2
3
4
5
binned.x <- tapply(x$Age, x$AgeGrp, mean) # average age per bin
binned.p <- tapply(x$Survived, x$AgeGrp, mean) # rate of survival per bin
plot(log(binned.p/(1-binned.p)) ~ binned.x, main="Empirical Logit Plot",
pch=16)
plot(binned.p ~ binned.x, main="Pr(Survival) by Age", pch=16)

As we can see, this assumption is hard to check with few observations, because
some of the bins are going to be extremely sparse (have very little data). When there are a large number of observations, it’s much easier to check this assumption between a single predictor and the response. However, when there are a lot of observations, there will likely be multiple predictors, and then checking this assumption for every predictor can be tedious as well. As a result, we won’t bother checking this assumption in general.

1
2
m2 <- glm(Survived ~ Age, data=x, family=binomial)
summary(m2)

You can see again that Age is significant at the $\alpha=0.05$ level. There is also a negative coefficient, suggesting that we would have a downward sloping logit curve. This suggests that older people have lower chances of surviving the extreme conditions.

1
2
3
4
plot(Survived ~ Age, data=x, pch=16)
tmp <- data.frame(Age=seq(0,70, by=1))
pred <- predict(m2, newdata=tmp, type="response")
lines(tmp$Age, pred, lwd=2, col="blue")

Both Age + Gender

1
2
m3 <- glm(Survived ~ Age + Gender, data=x, family=binomial)
summary(m3)

It appears that both Age and being male decreases the odds of survival. 95% confidence intervals for both coefficients are:

1
confint(m3, level=0.95)

Undoing the log-transformation yields:

1
exp(confint(m3, level=0.95))

Between two individuals of the same age but of different genders, the male was between 0.11 to 0.78 times as likely as the female to survive.

Between two individuals of the same gender but with age differing by 1 year, the older individual is 0.93 to 1.00 times as likely to survive as the younger individual.

In this model, we have not interacted Age and Gender, so essentially it is like
fitting 2 parallel lines to the log-odds of survival.

1
2
3
4
5
6
7
8
plot(Survived ~ Age, data=x, pch=16, col=ifelse(Gender=="Female", "red", "blue"))
agetmp <- seq(0,70, by=1)
pred_fem <- predict(m3, newdata=data.frame(Age=agetmp,
Gender = "Female"), type="response")
lines(tmp$Age, pred_fem, lwd=2, col="red")
pred_m <- predict(m3, newdata=data.frame(Age=agetmp,
Gender = "Male"), type="response")
lines(tmp$Age, pred_m, lwd=2, col="blue")

If we believe that the slopes of age should differ with respect to each gender, then we should interact the two variables:

1
m4 <- glm(Survived ~ Age * Gender, data=x, family=binomial)
1
2
3
4
5
6
7
8
plot(Survived ~ Age, data=x, pch=16, col=ifelse(Gender=="Female", "red", "blue"))
agetmp <- seq(0,70, by=1)
pred_fem <- predict(m4, newdata=data.frame(Age=agetmp,
Gender = "Female"), type="response")
lines(tmp$Age, pred_fem, lwd=2, col="red")
pred_m <- predict(m4, newdata=data.frame(Age=agetmp,
Gender = "Male"), type="response")
lines(tmp$Age, pred_m, lwd=2, col="blue")
1
summary(m4)

Was this a good thing to do? No, because now none of the predictors are statistically significant.

1
2
3
4
5
par(mfrow=c(1,2))
hist(x$Age[x$Gender == "Male"], xlim=c(0, 70), main="Age of Males",
breaks=seq(0,70,by=10), ylim=c(0,18))
hist(x$Age[x$Gender == "Female"], xlim=c(0, 70), main="Age of Females",
breaks=seq(0,70,by=10), ylim=c(0,18))

Significance of Predictors

Deviance is -2 times the log-likelihood of the model and can be thought of as something like residual sum of squares in linear regression, that is a quantity for which a smaller number represents a better fit of the model.

1
m4$deviance

Note that this corresponds to the line in the regression output called residual deviance. The null deviance is the deviance of a model containing no predictors (just the intercept). We can use the drop-in-deviance test for evaluating the significance of individual predictors:

1
anova(m4, test = "Chisq")

This presents an alternate way to test for the significance of individual predictors, and is often thought of as more reliable than Wald’s test, which is used for p-values in the regression summary.

We can also test nested models. Recall:

1
2
coef(m1)
coef(m4)
eval
1
anova(m1, m4, test="Chisq")

Why didn’t this work?

Let’s try this again…

1
2
m1a <- glm(Survived ~ Gender, data=x, subset=!is.na(x$Age), family=binomial)
summary(m1a)

When you use anova(lm.1,lm.2,test=”Chisq”), it performs the Chi-square test to compare lm.1 and lm.2 (i.e. it tests whether reduction in the residual sum of squares are statistically significant or not). Note that this makes sense only if lm.1 and lm.2 are nested models.

1
2
3
4
5
anova(m1a, m4, test="Chisq")
anova(m1a, m2, test="Chisq") # ??


anova(m3, m4, test="Chisq")

Confusion Matrices

Another way to compare models is to examine confusion matrices which
compare $y$ to $\hat y$. This can be used even if models are not nested.

1
2
table(fitted=(fitted(m1) >= 0.5)*1, actual=x$Survived)
mean((fitted(m1) >= 0.5)*1 != x$Survived)

By the way, is this good? Significantly better than random guessing?

1
2
table(fitted=(fitted(m2) >= 0.5)*1, actual=x$Survived[!is.na(x$Age)])
mean((fitted(m2) >= 0.5)*1 != x$Survived[!is.na(x$Age)])
1
2
table(fitted=(fitted(m3) >= 0.5)*1, actual=x$Survived[!is.na(x$Age)])
mean((fitted(m3) >= 0.5)*1 != x$Survived[!is.na(x$Age)])
1
2
table(fitted=(fitted(m4) >= 0.5)*1, actual=x$Survived[!is.na(x$Age)])
mean((fitted(m4) >= 0.5)*1 != x$Survived[!is.na(x$Age)])

本文链接: https://konelane.github.io/2018/04/11/new-article/

-- EOF --

¥^¥请氦核牛饮一盒奶~suki