Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: July 2015

Friday, July 17, 2015

Performance comparison of subset in R to filter in the deployer package

Recently I have been using RStudio’s dplyr package more and more. I started using the package mostly because of the convience of having all the manipulations I want to use on the data set all in one place. I had also started to use the “pipes” with the ggvis package so I also like the was the code looked as well. Frankly I have gotten a lot addicted to writting pipped code. Anyway I start to notice the dplyr really plowed through the work much faster than using the base R functions. I have been told this is because dplyr leverages data.table and other speed up approaches. So I thought I would test it out and see what the difference really is. I have a new Mac powerbook so all my results are off of that. I also use the hflights data set that used for the examples of the dplyr data set.
First I need to require the dplyr and hflights packages. The hflights data set is reasonably large.
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(hflights)
## Loading required package: hflights
dim(hflights)
## [1] 227496     21
The command subset() is similar to filter() in dplyr. So lets compare their performance
ptm1<-proc.time()
a<-subset(hflights, Distance>1500)
ptm1<-proc.time()-ptm1

ptm2<-proc.time()
b<-hflights %>%
  filter(Distance>1500)
ptm2<-proc.time() -ptm2

print(ptm1)
##    user  system elapsed 
##   0.023   0.007   0.030
print(ptm2)
##    user  system elapsed 
##   0.009   0.001   0.011
That is roughly a three fold decrease in total time!