Big Computing: October 2014

Tuesday, October 28, 2014

An example of using Random Forest in Caret with R.

Here is an example of using Random Forest in the Caret Package with R.

First Load in the required packages

require(caret)

## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2

require(ggplot2)
require(randomForest)

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

Read in the Training and Test Set.

training_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training<-read.csv(training_URL,na.strings=c("NA",""))
test<-read.csv(test_URL,na.strings=c("NA",""))

Then I got rid of the columns that is simply an index, timestamp or username.

training<-training[,7:160]
test<-test[,7:160]

Remove the columns that are mostly NAs. They could be useful in the model, but it is easier to cut the data.frame down and see if it gives good results

mostly_data<-apply(!is.na(training),2,sum)>19621
training<-training[,mostly_data]
test<-test[,mostly_data]
dim(training)

## [1] 19622    54

I partitioned the training set into a smaller set called training1 really to speed up the running of the model

InTrain<-createDataPartition(y=training$classe,p=0.3,list=FALSE)
training1<-training[InTrain,]

So I used caret with random forest as my model with 5 fold cross validation

rf_model<-train(classe~.,data=training1,method="rf",
                trControl=trainControl(method="cv",number=5),
                prox=TRUE,allowParallel=TRUE)
print(rf_model)

## Random Forest 
## 
## 5889 samples
##   53 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##    2    1         1      0.006        0.008   
##   27    1         1      0.005        0.006   
##   53    1         1      0.006        0.007   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

print(rf_model$finalModel)

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE,      allowParallel = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.88%
## Confusion matrix:
##      A    B    C   D    E class.error
## A 1674    0    0   0    0     0.00000
## B   11 1119    9   1    0     0.01842
## C    0   11 1015   1    0     0.01168
## D    0    2   10 952    1     0.01347
## E    0    1    0   5 1077     0.00554

That is a pretty amazingly good model! .987 accuracy! I usually hope for something in the >.7 in my real work.

Friday, October 24, 2014

Where does the Boston Bruin's Bar Bill rank in the big bar bills ever

In 2011 the Boston Bruins went to a bar at MGM Grand Casino in Connecticut after they won the Stanley cup and spent $156K. That seemed to me to be a crazy massive bill although the bulk of it was made up of one massively expensive bottle of wine. I wrote about that back in 2011. The funny thing is that this bill is only ranked the sixth largest bar bill ever! Now most of the other bills are from Russian billionaires and investment bankers, but there is a bill from Lebron James in Las Vegas that is larger than the Bruin's bill by about $30K. I do not think Lebron will top that bill back in Cleveland because they just do not have that kind of expensive booze in Ohio. Here is the list of most expensive bills ever.

The funny thing about all these events is that a major portion of the bill is made up of a ridiculously expensive large format bottle of wine (usually champagne). Without these parties these bottles would never sell. After the party these bottles are usually signed and put on display at the club that continues to use that party to advertise their club as the MGM does with the Bruin's bottle. It make you wonder if these people ever really paid for these parties or if there was some side advertising deal. Party on...

Thursday, October 23, 2014

A Project using the data from the Boston Subway System - MTA

I was in Boston yesterday, and the group that I was in ended up talking about the Boston Subway system called the MTA. We ended up talking about projects that have used the recently opened up Data on the MTAsystem. One of the best visualizations I have seen on the system is by Mike Barry and Brian Card on Github. I provide the link here for all those who want to enjoy their work.

Tuesday, October 21, 2014

Coursera Data Mining Class uses the Caret Package by Max Kuhn

I have been taking the Coursera Data Science track for fun over the last couple of months. Each class is about a month and it is all in R which is great. Although the classes are fairly basic I have found them enjoyable to do, and some of their examples have given me nicer ways to do things than how I have done the operations in the past. The eight class in the series is called Practical Machine Learning. So far it has been a great ride through the Max Kuhn's Caret package. I have been using this package since 2008. I always believed that Caret had become the defacto Machine Learning package in R. Part of the reason for this is that it contains something like 187 different models within the package. The main reason for me is the unified interface to those models make it easy to try models that you are not an expert in. This makes the process of modeling better because it used to be people only used models they new well doused often which might not be the best model for the data they are working on. Caret lowers the barriers for model uses and open the door to better and more robust prediction.

If you have never used the Caret package you should try it in the Coursera Class. If you have and want to learn more here is the website. Also their is Max's book which uses Caret called Applied Predictive Modeling.

Sunday, October 19, 2014

Boston R Meetup with Bryan Lewis, R, Shiny, RStudio and SciDB

Bryan Lewis is Paradigm4's Chief Data Scientist and a partner at Big Computing. He is the author of a number of R packages including SciDB-R, and has worked on many performance-related aspects of R. Bryan is an applied mathematician, kayaker, and avid amateur mycologist.

For years Bryan has been excited about the potential of computational data bases like SciDB to provide fast analytics of truly massive data sets. I have seen Bryan give this talk and it is well worth your time. In addition he has added Rstudio and Shiny to his talk which makes this a must go meet up in Boston on Wednesday night. Shiny has really enhance the ability to communicate results to non-experts through interactive visualizations that are easy to create and modify.

Here is the abstract to his talk:

R is a powerful system for computation and visualization widely used in biostatistics and analyses of genomic data. An active and engaged research community continuously expands R's capabilities through thousands of available packages on CRAN and Bioconductor.

SciDB is a scalable open-source database used in large omics workloads like the NCBI 1000 genomes project and other genomic variant applications, applications derived from the cancer genome atlas, and more.

The `scidb` package for R available on CRAN lets R researchers use SciDB on very large datasets directly from R without learning a new database query language. Bryan will use RStudio, Shiny, and SciDB to demonstrate a few common, real-world genomics analysis workflows in use today by SciDB-R users. We'll see basic techniques like enrichment problems using fast parallel Fisher tests and also more challenging problems like large-scale correlation and network analysis.

Here is the link to the meetup.

Friday, October 17, 2014

R Statistical Programming Training and the Shift in R Consulting.

At Bigcomputing we have been R consultants since 2009. We have provided consulting in a variety of verticals to build, improve or speed up the algorithms of our customers, and we have provided R training. Over the last few years there has been a shift in the data science consulting business that no one is talking about.

In 2009 working on algorithms represented 90% of our business, training 10% and visualization 0%. This distribution was similar for all the R consulting groups that I knew of at the time and held that way for a number of years.

Starting in 2012 and continuing till today. Algorithm building has become a smaller and smaller part of the business while the demand for training and interactive visualization tools has exploded.

How big has that shift been? I know of one consulting group that has stopped doing anything but training and had to triple in size to meet the demand.

We have experienced a similar shift. Now our business is made up of mostly training and visualizations. Although Bigcomputing has not abandoned the algorithmic work. Why/ Because we believe that after the people we have trained develop their skills they will want and need to do more. We can only help them to do that if we are doing that more sophisticated work already with the most current developments.

Underneath this change I feel there is also a shift in the data science world away from the most accurate models to a more standard robust and interpretable modeling approach. These models tend to be easily visualized in an interactive way that can be shared with non- data scientists not in a result oriented way but in an informative way. This is a powerful shift because if we adopt the simpler approaches they can be done by a much larger group of people in an effective way. For example if I use a simple model in R, I can typically do that with one line of code versus the many lines of code for some type of ensemble method. Anyone can write one line of code. All they need is a little bit of training which we are giving them creating.

Wednesday, October 15, 2014

Interesting Article on Making Money in Fantasy Sport.

The New York Time posted a story about a guy who has made a reasonable profit on Fantasy Sports sites. Here is a link to the Site.

I have done work in this area both for fun and profit. Years ago Drew Conway started a machine learning fantasy football league which I played in. I think it is fair to say that the results of that league were unimpressive even though it had within it some the most successful predictive analytics guys in the country. Later we did some work on picking inefficiencies in the sports betting market which was successful. It made a little bit of money week after week. It was a interesting project, but it simply was not profitable enough to justify the work involved.

Over time we learned a little about which sports are easier to do prediction in. The hardest by far of the pro sports was the NFL. The relatively few games in the NFL is a big reason for this. However the NFL is not equally difficult throughout the season. The first three and last three weeks are the one that are the hardest to predict with the middle weeks being reasonable. Interestingly enough we found the NBA with its numerous games, small number of playesr and other factors was consistently the best data to work with either on a player level for fantasy or on a team level for sports betting. Although we also found that the sport betting market for the NBA was an efficient one compared to other sports so any change needed to be taken advantage of quickly. I found it interesting that the NBA is the sport that the New York Times article focused on because it is where I would focus on as well.

Sunday, October 12, 2014

Interactive map of the Spread of ISIS

I ran across this map by on the international security website. It is a nice clean interactive map, and it also shows how massive the area Isis is trying control. They will need a lot of people to do that. Here is the link to the map. LINK

Friday, October 10, 2014

Using the Pairs Plot Function in Base R

The pairs plot is used to compare all the data in the selected range with all the other columns on a one to one basis. It is a great first pass to see if a relationship between two variables jumps out at you. In order to show this I will create four variable. The first is a random uniform. The second variable depends on the first variable in a linear relationship. The third variable has a squared relationship with the first variable. Finally the fourth variable has a cubic relationship with the first variable.

indep<-runif(300,-25,25)
linear<-2*indep+rnorm(300,0,3)
exp2<-.05*indep^2+rnorm(300,0,2)
exp3<-indep^3/1000+rnorm(300,0,1)
samples<-as.data.frame(cbind(indep,linear,exp2,exp3))

Now plot them using pairs and see what it produces

pairs(~indep+linear+exp2+exp3,data=samples)

The plots in this case do a preety good job of showing us what is going on. As an interesting aside if we just call the plot function on the data.frame sample R does a sample plot for us.

plot(samples)

Subscribe To My Blog