Recently I have been using RStudio’s dplyr package more and more. I started using the package mostly because of the convience of having all the manipulations I want to use on the data set all in one place. I had also started to use the “pipes” with the ggvis package so I also like the was the code looked as well. Frankly I have gotten a lot addicted to writting pipped code. Anyway I start to notice the dplyr really plowed through the work much faster than using the base R functions. I have been told this is because dplyr leverages data.table and other speed up approaches. So I thought I would test it out and see what the difference really is. I have a new Mac powerbook so all my results are off of that. I also use the hflights data set that used for the examples of the dplyr data set.
First I need to require the dplyr and hflights packages. The hflights data set is reasonably large.
First I need to require the dplyr and hflights packages. The hflights data set is reasonably large.
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(hflights)
## Loading required package: hflights
dim(hflights)
## [1] 227496 21
The command subset() is similar to filter() in dplyr. So lets compare their performanceptm1<-proc.time()
a<-subset(hflights, Distance>1500)
ptm1<-proc.time()-ptm1
ptm2<-proc.time()
b<-hflights %>%
filter(Distance>1500)
ptm2<-proc.time() -ptm2
print(ptm1)
## user system elapsed
## 0.023 0.007 0.030
print(ptm2)
## user system elapsed
## 0.009 0.001 0.011
That is roughly a three fold decrease in total time!