Data Transformation Practice

Duration: ~30 Minutes

Clock made of Legos

Learning Objectives

  • Summarize data
  • Transform your results

Prepare your project

  1. Create a new R script file in your project folder
  2. Add a heading
  3. Load tidyverse
  4. Download the accident dataset
  5. Place the dataset in your data folder


accident <- read_csv("Data/fars2015nys_accident.csv", 
                     col_types="cnccnccnnnnnnncccccc") 

Practice spread()

For this exercise, create two tables.

  1. Total fatalities by county by month
  2. Average vehicles per accident by county by month

For these examples, county will be the rows and month will be the columns

Variables you will need

  • FATALS: number of deaths per accident
  • VE_TOTAL: number of vehicles per accident
  • MONTH: three-letter month abbreviation
  • ST_CO_FIPS: county FIPS code

What should you do first?

This is a complex exercise that will use several things you have learned so far.

Click the down arrow if you need a hint or two.

Hint 1

Create a summary dataset. Remember to group variables first.

Hint 2

Identify your key and value variables.

Create the summary dataset


summarized_accident <- accident %>%
  group_by(ST_CO_FIPS, MONTH) %>%
  summarize(
    total_deaths = sum(FATALS),
    avg_vehicles = mean(VE_TOTAL)
  ) %>% 
  ungroup()
  
    

Table 1

Create total vehicular fatalities by county by month.


spread_deaths <- summarized_accident %>%
  select(-avg_vehicles) %>%
  spread(key = MONTH, value = total_deaths)

    

Note: in this case, missings mean there were no deaths. To replace missings with 0, use this code.


spread_deaths[is.na(spread_deaths)] <- 0

    

Click the down arrow if you need help with the code for vehicles.

Table 2

Create average vehicles per accident by county by month.


spread_vehicles <- summarized_accident %>%
  select(-total_deaths) %>%
  spread(key = MONTH, value = avg_vehicles)

spread_vehicles[is.na(spread_vehicles)] <- 0

    

Practice gather()

For this exercise, combine your two spread tables to recreate summarized_accident.

What should you do first?

This is a complex exercise that will use several things you have learned so far.

Click the down arrow if you need a hint or three.

Hint 1

Convert each spread table from long to wide.

Hint 2

Merge the two tables by county and month.

Hint 3

Remove all rows for which both total fatalities and average vehicles equal 0.

Gather each dataset

Create a long table for the deaths.


gathered_deaths <- spread_deaths %>%
  gather(key = MONTH, value = total_deaths, -ST_CO_FIPS)

    

Click the down arrow in you need help with the code for vehicles.

Create a long table for the vehicles.


gathered_vehicles <- spread_vehicles %>%
  gather(key = MONTH, value = avg_vehicles, -ST_CO_FIPS)

    

Merge the tables


gathered_accident <- gathered_deaths %>%
  left_join(gathered_vehicles, by = c("ST_CO_FIPS", "MONTH"))

    

For this exercise, full_join and right_join would also work. Why?

Bonus content

To test if the tables are equivalent, try these:


nrow(gathered_accident) == nrow(summarized_accident) 

# this function will give either TRUE 
# or at least one way the tables are not equivalent
all.equal(gathered_accident, summarized_accident)

  

Why are they not eqivalent? What can we do to fix that?

Make tables equivalent

Remove extra rows from gathered_accident.


gathered_accident <- gathered_accident %>% 
  filter(avg_vehicles != 0 | total_deaths != 0)

  

To make the tables identical, you would also need to sort row order.

And Now You Know!

Q & A

Enjoy R and please fill out our evaluation.