If you need to load the passengers data set, run this:
source("http://ajstamm.github.io/titanic/workshops/regression/import-passengers.R")
Note: source()
runs all code in the file.
Dependent variable (outcome): what we want to predict.
tbl_surv <- table(passengers$survived)
tbl_surv
0 1
140 110
## table above as percentages
100*prop.table(tbl_surv)
0 1
56 44
died
than survived
.
passengers$died <- 1
passengers$died[passengers$survived == 1] <- 0
## Check if survived and died are opposites.
## head() shows the first 6 rows of your table
head(passengers[, c("survived", "died")])
survived died
1 0 1
2 1 0
3 1 0
4 1 0
5 0 1
6 0 1
Independent variable: sex
; Dependent variable: died
## Cannot have NAs in our data.
sum(is.na(passengers$sex))
[1] 0
sum(is.na(passengers$died))
[1] 0
## We can only graph numeric variables, not categorical.
passengers$is_male <- 0
passengers$is_male[passengers$sex == "male"] <- 1
logit_sex <- glm(formula=died~is_male, family=binomial,
data=passengers)
To what extent does is_male explain died?
summary(logit_sex)
Call:
glm(formula = died ~ is_male, family = binomial, data = passengers)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7125 -0.8112 0.7244 0.7244 1.5948
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9426 0.2152 -4.380 1.19e-05 ***
sexmale 2.1466 0.2928 7.332 2.27e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 342.96 on 249 degrees of freedom
Residual deviance: 281.47 on 248 degrees of freedom
AIC: 285.47
Number of Fisher Scoring iterations: 4
plot(jitter(passengers$died, .4),
jitter(passengers$is_male, .4))
curve(predict(logit_sex, data.frame(is_male=x), type="response"),
add=TRUE)
Independent variable: age
; Dependent variable: died
## Cannot have NAs in our data.
sum(is.na(passengers$age))
[1] 0
sum(is.na(passengers$died))
[1] 0
logit_age <- glm(formula=died~age, family=binomial,
data=passengers)
To what extent does the variable age explain died?
summary(logit_age)
Call:
glm(formula = died~age, family = binomial, data = passengers)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.366 -1.273 1.037 1.089 1.145
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.063888 0.303464 0.211 0.833
age 0.005861 0.009127 0.642 0.521
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 342.96 on 249 degrees of freedom
Residual deviance: 342.55 on 248 degrees of freedom
AIC: 346.55
Number of Fisher Scoring iterations: 4
plot(passengers$age, passengers$died)
curve(predict(g_age, data.frame(age=x), type="response"), add=TRUE)
Yeah - That's supposed to be curved.
Click on the image to make it easy to really LOOK at it.
How can we use age to predict died?
1. Explore. 2. Learn new tricks.
## You've seen this:
prop.table(table(passengers$died))
0 1
0.44 0.56
## Indexing / Filtering: What percent of passengers
## five years of age or younger died?
prop.table(table(passengers$died[passengers$age <= 5]))
0 1
0.8181818 0.1818182
Note: I'm TIRED of typing passengers.
with |
&, | |
round |
## - Only write passengers once
## - Percent of passengers died older than 5 and younger than 11.
with(passengers,
prop.table(table(died[age > 5 & age <= 10])))
0 1
0.5714286 0.4285714
## Change proportion to percent and round:
with(passengers,
round(prop.table(table(died[age > 5 & age <= 10]))*100))
## Let's create a new factor:
passengers$age_group <- NA
passengers$age_group[passengers$age <= 10] <- "00-10"
passengers$age_group[passengers$age > 10 &
passengers$age <= 20] <- "11-20"
passengers$age_group[passengers$age > 20 &
passengers$age <= 30] <- "21-30"
passengers$age_group[passengers$age > 30 &
passengers$age <= 40] <- "31-40"
passengers$age_group[passengers$age > 40 &
passengers$age <= 50] <- "41-50"
passengers$age_group[passengers$age > 50 &
passengers$age <= 60] <- "51-60"
passengers$age_group[passengers$age > 60] <- "61+"
## This should be familiar by now:
with(passengers,
round(prop.table(table(age_group, died),1)*100,1))
## But this is nice!
multi_tbl <- with(passengers,
round(prop.table(100 *
table(age_group, sex, died), 1), 1))
ftable(multi_tbl)
died 0 1
age_group sex
00-10 female 27.8 22.2
male 44.4 5.6
11-20 female 24.3 13.5
male 5.4 56.8
21-30 female 31.3 14.5
male 10.8 43.4
31-40 female 41.7 10.0
male 10.0 38.3
41-50 female 21.2 9.1
male 18.2 51.5
51-60 female 40.0 0.0
male 10.0 50.0
61+ female 11.1 0.0
male 11.1 77.8
Is age_group
a statistically significant predictor of died
?
No peeking!
logit_age <- glm(formula=died~age_group, family=binomial,
data=passengers)
summary(logit_age)
Call:
glm(formula = died ~ age_group, family = binomial, data = passengers)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.734 -1.177 0.840 1.047 1.601
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9555 0.5262 -1.816 0.06941 .
age_group11-20 1.8157 0.6374 2.849 0.00439 **
age_group21-30 1.2714 0.5713 2.226 0.02604 *
age_group31-40 0.8888 0.5862 1.516 0.12948
age_group41-50 1.3863 0.6355 2.181 0.02915 *
age_group51-60 0.9555 0.8228 1.161 0.24550
age_group61+ 2.2083 0.9591 2.303 0.02130 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 342.96 on 249 degrees of freedom
Residual deviance: 330.08 on 243 degrees of freedom
AIC: 344.08
Number of Fisher Scoring iterations: 4
Tweak these groups to reduce the residuals further.
But don't overfit.
How are these two models different?
logit_complex <- glm(formula =
died ~ age_group+is_male,
family = binomial,
data = passengers)
summary(logit_complex)
logit_complex <- glm(formula =
died ~ age_group*is_male,
family = binomial,
data = passengers)
summary(logit_complex)
?formula
Quantifies how strongly the presence or absence of property A is associated with the presence or absence of property B in a given population.
Question: how is being male associated with dying?
Logistic regression calculates OR, not RR.
## Remember: Rows (died), Columns (is_male)
t_male <- with(passengers,
table(died, is_male, dnn=c("died","is_male")))
t_male
is_male
died 0 1
0 77 33
1 30 110
## Those numbers alone tell you alot.
Man: |
110/33
|
Woman: |
30/77
|
The ratio of these two odds:
(110/33)/(30/77)
|
or . . . |
(110*77)/(33*30)
|
Can you figure out how to do that with R?
It is just some fancy indexing trickery.
## Remember:
t_male
is_male
died 0 1
0 77 33
1 30 110
Thus, our solution . . .
## Remember, you can use R like a giant calculator:
(t_male[2,2] * t_male[1,1]) / (t_male[2,1]*t_male[1,2])
[1] 8.555556
## This does not come with confidence intervals.
What is the relationship between OR and logistic regression?
log()
of the OR == the coefficient from
the logit model.
exp()
of the coefficient from the logic model == the OR.
## How does this help us?
names(summary(logit_sex))
## Step 1: Access the model coefficients:
summary(logit_sex)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.942608 0.2152215 -4.379712 1.188363e-05
is_male 2.146581 0.2927698 7.331975 2.267860e-13
## Step 2: Calculate the exponent of the coefficients:
exp( summary(logit_sex)$coefficients )
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3896104 1.240137 1.252897e-02 1.000012
is_male 8.5555556 1.340134 1.528397e+03 1.000000
## Step 1: Access the model coefficients:
logit_sex_coef <- summary(logit_sex)$coefficients
## Step 2: Calculate the exponent of the coefficients:
exp( logit_sex_coef )
## - Step 1 returns a vector and saves it as logit_sex_coef.
## - Step 2 calculates the exponent for each item in the vector.
Using R is like building a spaceship with LEGOS!
Assemble little things to build big things.
The following ad is paid for by our sponsors.