Explorartory Analysis with Uber Data in R

Explorartory Analysis with Uber Data in R

Import Data

The dataset we will be working on today will be uber’s NYC September 2014 dataset. You can find the dataset and a bunch more on Kaggle or click here.

Once you have your data, import it, look at the data structure. We’re also going to use the tidyverse package for this tutorial.

str(uber)
 $ Date.Time: Factor w/ 42907 levels "9/1/2014 0:00:00",..: 2 2 4 7 12 13 16 17 33 34 ...
 $ Lat      : num  40.2 40.8 40.8 40.7 40.8 ...
 $ Lon      : num  -74 -74 -74 -74 -73.9 ...
 $ Base     : Factor w/ 5 levels "B02512","B02598",..: 1 1 1 1 1 1 1 1 1 1 ...

Notice that the datetime is a factor and it does not follow the ISO 8601 standard. R likes reading datetimes in this format. The data structure in R is called POSIXct or POSIXlt. We don’t need to get into the details of it but just know that when you run an str() POSIXct or POSIXlt is preferable. You will run into a bunch of problems if it is otherwise.

Date.TimeLatLonBase
9/1/2014 0:01:0040.2201-74.0021B02512
9/1/2014 0:01:0040.7500-74.0027B02512
9/1/2014 0:03:0040.7559-73.9864B02512
9/1/2014 0:06:0040.7450-73.9889B02512
9/1/2014 0:11:0040.8145-73.9444B02512
9/1/2014 0:12:0040.6735-73.9918B02512

Clean Datetime

There’s a bunch of different ways to clean up datatimes but the method is more or less the same:

  • Convert factor to character
uber$Date.Time = as.character(uber$Date.Time)
  • Use a datetime function to convert from chr to POSIXct/POSIXlt
uber$Date.Time = strptime(uber$Date.Time, "%m/%d/%Y %H:%M:%S") #base r
parse_date_time(uber$Date.Time, order = 'mdy HMS') #readr package

strptime() is very fast but is not flexible with weird date formats. It will work fine for this dataset. Below is a chart of datetime formats you will encounter and how you should specify your datetime when you convert it in strptime()

Datetime formats

R FormatMeaningR FormatMeaning
%aAbbreviated weekday%AFull weekday
%bAbbreviated month%BFull month
%cLocale-specific date and time%dDecimal date
%HDecimal hours (24 hour)%IDecimal hours (12 hour)
%jDecimal day of the year%mDecimal month
%MDecimal minute%pLocale-specific AM/PM
%SDecimal second%UDecimal week of the year (starting on Sunday)
%wDecimal Weekday (0=Sunday)%WDecimal week of the year (starting on Monday)
%xLocale-specific Date%XLocale-specific Time
%y2-digit year%Y4-digit year
%zOffset from GMT%ZTime zone (character)

parse_date_time is a more forgiving datetime function. It accepts a wider range of datetime formats and you will not need to use the chart above to specify your datetimes. However, this makes the function a lot slower to work with. If your datetime follows some kind of uniform pattern, strptime() will work just fine.

Aggregate Data

Let’s look at trip volume per day by base.

  • Extract date from datetime:
    • uber$date = substr(uber$Date.Time, 1,10)
  • date is now a chr, convert it to POSIXlt:
    • uber$date = strptime(uber$date, "%Y-%m-%d")
  • Aggregate the data:
    • table(uber$date,uber$Base)
DateB02512B02598B02617B02682B02764
2014-09-016384626794036773080
2014-09-02118869701164257293302
2014-09-03128480791301964623787
2014-09-04151394121518576704580
2014-09-051808100361647286605343
2014-09-06158098481557381325387
2014-09-07112471221163959834266

We might want this to be in a data frame, so we can play around with it.

  • trips = data.frame(table(uber$date,uber$Base))
    • Notice that our data is saved in a long format. It’s not as nicely formatted as the chart above. We’ll revisit this later when we graph our data.
  • trips = spread(trips,Var2,Freq) This reformats our data nicely.

Now our data frame should look something like the chart above but the date is named Var1. colnames(trips[1]) <- "date" That should fix it!

Graph

Now it’s time to graph. You want your data in a long format like what trips = data.frame(table(uber$date,uber$Base)) gave you. You can just re-run this code or you can use gather() function to bring our data frame back to it’s original state. graph = gather(trips,'B02512','B02598', 'B02682', 'B02764', 'B02617', key = "Base", value = "Trips")

dateBaseTrips
2014-09-01B02512638
2014-09-02B025121188
2014-09-03B025121284
2014-09-04B025121513
2014-09-05B025121808
2014-09-06B025121580

It looks like our dates are back into factors. No worries, since we’re already in the ISO 8601 format, converting is a breeze. graph$date = as.POSIXct(as.character(graph$date))

After all our hard work, we can finally graph our data! ggplot(graph, aes(x = Date, y = Trips, color = as.factor(Base))) + geom_line() png

Avatar
Jacky Lam
Senior Data Engineer

Data engineer focused on analytics platforms and data products in Snowflake and dbt. Works across marketing, product, and analytics — turning messy, multi-source event streams into models teams can confidently make decisions with.