Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: 2015

Wednesday, August 5, 2015

A Controversy Over the Method Used in Selection of the First Republican Presidential Debate

What a surprise that after the two groups were announced for the first Republican Presidential Debate there is controversy over the method of selection. The only thing that does ring true to me is that with the margin off error of +/- 3.7%  in these polls the only candidates in the top ten with any confidence are Trump, Walker, Bush and "I don't know". In fact "I don't know" is in a statistical tie with everyone except Trump. Here is the latest Quinnipiac poll. Admittedly "i don't know" may be doing so well because he/she is such an unknown with no public record.

All kidding aside I did not think that any method chosen would not have someone crying foul or unfair. This is because there really is no best way to do this. You have to pick a method and go with it. Now for me they did not chose the most interesting way to do this. I mean presidential debates are like the NFL or NBA drafts of the 1950s. No one watches and no one cares except to read the news opinions the next morning about who won and who lost. I do not think this is a good situation so the attention should be to increase interest in and viewership of the Presidential Debates. The NFL and NBA have a model on how to do this. Have a draft and/or a lottery.

Image if Fox had held this event the Sunday before the debate. It would be new, different and certainly buzz worthy. All the candidates waiting on stage to see when their name gets drawn and which group they are in are. It would be awesome! That would be the option if the names were simply drawn and the candidates were randomly assigned and randomly selected. I think having an actual lottery would be even more interesting. Each candidate would get to select which group they would like to be in. The criteria could be random selection or they could select in order of ranking in the latest polls. Now there is strategy involved, and strategy is a rich and engaging  area for discussion and commentary. Imagine if Trump selects the prime time debate first. If you select second did you chose to be in the same debate with Trump and all his showmanship or do you select to be in the other group to avoid him? As the groups fill up do you select to be in the group with candidates similar to you or the group most different? What a rich environment for the political pundits! Also the post debate analysis would include not only how the candidates did in the debate, but how their selection strategy worked out.

Friday, July 17, 2015

Performance comparison of subset in R to filter in the deployer package

Recently I have been using RStudio’s dplyr package more and more. I started using the package mostly because of the convience of having all the manipulations I want to use on the data set all in one place. I had also started to use the “pipes” with the ggvis package so I also like the was the code looked as well. Frankly I have gotten a lot addicted to writting pipped code. Anyway I start to notice the dplyr really plowed through the work much faster than using the base R functions. I have been told this is because dplyr leverages data.table and other speed up approaches. So I thought I would test it out and see what the difference really is. I have a new Mac powerbook so all my results are off of that. I also use the hflights data set that used for the examples of the dplyr data set.
First I need to require the dplyr and hflights packages. The hflights data set is reasonably large.
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(hflights)
## Loading required package: hflights
dim(hflights)
## [1] 227496     21
The command subset() is similar to filter() in dplyr. So lets compare their performance
ptm1<-proc.time()
a<-subset(hflights, Distance>1500)
ptm1<-proc.time()-ptm1

ptm2<-proc.time()
b<-hflights %>%
  filter(Distance>1500)
ptm2<-proc.time() -ptm2

print(ptm1)
##    user  system elapsed 
##   0.023   0.007   0.030
print(ptm2)
##    user  system elapsed 
##   0.009   0.001   0.011
That is roughly a three fold decrease in total time!