Recently I have been using RStudio’s dplyr package more and more. I started using the package mostly because of the convience of having all the manipulations I want to use on the data set all in one place. I had also started to use the “pipes” with the ggvis package so I also like the was the code looked as well. Frankly I have gotten a lot addicted to writting pipped code. Anyway I start to notice the dplyr really plowed through the work much faster than using the base R functions. I have been told this is because dplyr leverages data.table and other speed up approaches. So I thought I would test it out and see what the difference really is. I have a new Mac powerbook so all my results are off of that. I also use the hflights data set that used for the examples of the dplyr data set.

First I need to require the dplyr and hflights packages. The hflights data set is reasonably large.

First I need to require the dplyr and hflights packages. The hflights data set is reasonably large.

`require(dplyr)`

```
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

`require(hflights)`

`## Loading required package: hflights`

`dim(hflights)`

`## [1] 227496 21`

The command subset() is similar to filter() in dplyr. So lets compare their performance```
ptm1<-proc.time()
a<-subset(hflights, Distance>1500)
ptm1<-proc.time()-ptm1
ptm2<-proc.time()
b<-hflights %>%
filter(Distance>1500)
ptm2<-proc.time() -ptm2
print(ptm1)
```

```
## user system elapsed
## 0.023 0.007 0.030
```

`print(ptm2)`

```
## user system elapsed
## 0.009 0.001 0.011
```

That is roughly a three fold decrease in total time!
## No comments:

## Post a Comment