Duration: ~30 Minutes
accident <- read_csv("Data/fars2015nys_accident.csv",
col_types="cnccnccnnnnnnncccccc")
vehicle <- read_csv("Data/fars2015nys_vehicle.csv",
col_types="ccccncnnccnccc")
person <- read_csv("Data/fars2015nys_person.csv",
col_types="cccncccccccccc")
Remove the following extra variables before merging.
Hint: use select()
.
accident <- accident %>%
select( -LATITUDE, -LONGITUD )
The code below is a variation on what you've already learned.
vehicle <- vehicle %>%
select( -(MAKE:VIN), -(TRAV_SP:DEFORMED) )
person <- person %>%
select( -(DRINKING:DOA), -SEAT_POS )
Hint: identify the keys. You can join on more than one variable at a time.
Click the down arrow for code.
my_fulldata <- accident %>%
full_join(vehicle, by = "ST_CASE") %>%
full_join(person, by = c("ST_CASE", "VEH_NO", "ST_CO_FIPS"))
Suppose you only want vehicle occupants. How will you join or filter the datasets to exclude everyone else?
One possible solution is below.
my_peopleincars <- accident %>%
full_join(vehicle, by = "ST_CASE") %>%
inner_join(person, by = c("ST_CASE", "VEH_NO", "ST_CO_FIPS"))
Multiple possible methods:
## method 1
my_pedestrians1 <- my_fulldata %>%
filter( is.na(NUMOCCS) ) # I picked a random variable
# unique to the vehicle dataset.
# Why might this not work?
## method 2
my_pedestrians2 <- person %>%
filter( !PER_TYP %in% c("driver", "passenger") ) %>%
left_join(accident, by = c("ST_CASE", "ST_CO_FIPS"))
What's up with the discrepancy?
Let's learn another useful function.
pedestrian_test <- my_pedestrians2 %>%
anti_join(my_pedestrians1, my_pedestrians2, by=c("ST_CASE", "PER_NO"))
anti_join()
selects all cases that are in only one dataset.
Type ?anti_join
to learn more.
Check if your dataset only contains pedestrians. Which variable could you use?
I will provide two solutions on the next slide, one with base R and one with tidyverse.
You could use base R,
table(my_pedestrians2$VEH_NO)
... or you could use tidyverse.
my_pedestrians2 %>% group_by(VEH_NO) %>% summarize(total = n())
Some possible hypotheses to test:
Let's explore these.
The dataset includes hour and daylight. Your question would dictate
which is more appropriate. My question is,
"Are pedestrians most likely to be hit during the day, at night, or in between?"
To answer this, evaluate the variable LGT_COND
.
Create a summary table with count and proportion of pedestrian deaths by time of day.
my_pedestrians3 %>% group_by(LGT_COND) %>%
summarize(total = n(), percent = 100*total/nrow(.))
How could we change the code to look at time of day by whether the pedestrian died?
my_pedestrians3 %>% group_by(LGT_COND, died = INJ_SEV=="fatal") %>%
summarize(total = n(), percent = 100*total/nrow(.))
Investigate one or more of these variables.
Recode:
Raise your hand if you have questions or need help.
Create a dataset with the following:
Answer these questions using your dataset.
Hint for Question 3: NYS fips codes here
my_countydata <- my_fulldata %>% group_by(ST_CO_FIPS) %>%
summarize(
Total_Deaths = sum(INJ_SEV == "fatal", na.rm=TRUE),
Pedestrian_Deaths = sum(INJ_SEV == "fatal" & VEH_NO == 0,
na.rm=TRUE),
Total_Accidents = length(unique(ST_CASE)),
Avg_Death_Accident = Total_Deaths / Total_Accidents,
Avg_Ped_Death_Accident = Pedestrian_Deaths / Total_Accidents
)
Click down for answers.
# highest avg deaths per accident
my_countydata %>% arrange(-Avg_Death_Accident)
# most deaths
my_countydata %>% arrange(-Total_Deaths)
# check who is missing
# first create a data frame of all county fips codes
allfips <- data.frame(ST_CO_FIPS = seq(36001, 36123, by=2))
# Then check if all fips codes are in our dataset
allfips %>% filter(!ST_CO_FIPS %in% my_countydata$ST_CO_FIPS)