Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: 2014

Tuesday, December 9, 2014

R for Everyone by Jared Lander

This last week I have been re-reading R for Everyone by Jared Lander. I got a copy from Jared when the book first came out. It fills a need in the R community to supply help in getting new R users up to speed quickly so they can start doing work as soon as possible. 

Historically the learning curve for R has been a real impediment to developing new users. R has a steep learning curve because R exists in an environment between statistical programming and computer programming. This makes it a little out of the comfort zone for the typical types of people who use it. 

Lately that has begun to change. The introduction of RStudio to the R world has made the use R so much easier to deal with by automating all the support tasks needed to productively use R. Gone are the days of using a mashup of text editors, command lines, bash programming and multiple open windows to do an R project. Now it is all clean and organized through Rstudio so all the R user has to worry about is the Data and programming in R. 

Coursera has done a good job of bring up completely new users to R with their Data Science Specialization. If you are starting from ground zero it is the way to go. It will get you started in a solid and robust way. The only issue is that Coursera is nine one month classes. If you have some experience in this area nine months is too long to get up to speed, and may be painful review rather than true learning.

The solution is Jared's book. If you have had some basic programming or some statistical background you can start doing real work right after reading this book. This is a week at most not nine months. No Jared book does not cover all the things a user would do in R, but it covers the 20% of R that users do 99% of the time. There is no exotic or special here. It is just that basics to get rolling, and that is what new users need.

So if you have not done much work in R, and want to get off to a quick start with some solid basics I would strongly recommend R for Everyone.

Monday, December 1, 2014

Best New Board Game

I do not usually blog about board games, but game and game theory are an important part of data science. In fact, I would say that that a good game player makes a good data scientist.

Recently a blog called BoardGameGeek posted a article about a game called Nika. Nika is a game out of a new generation of game designers that return board games to a simpler environment. Gone are the stacks of cards and piles of dice and other members of the entourage of pieces that came with a board game of my generation. However, do not let the simplicity of design fool you into believing this is an easy game easily mastered. Nika is as complex and challenging game as the players wish to make it. The game is also an 2013 Ion Award winner for best strategy game.

The game itself was thought up by designer Josh Raab. Josh is currently a Masters student studying game design at NYU. He designed the game and originally put it on Kickstarter where it was easily funded. You can buy the game from the eaglegames website.

I hope you enjoy the game and the time away from the computer. If you want to know more about the game designer here is a link to his blog.


Tuesday, November 18, 2014

An example of a Shiny App using a rCharts scatter plot

I was finally able to do an example of a Shiny App using the rCharts scatter plot. Switching this example from a standard R scatter plot to rCharts took me way longer than I expected. Now that I have finished it, I am not exactly sure why it was so challenging because the final code looks pretty simple and clean. I also changed my code from the original app so that instead of two R code files (ui.R and server.R) there is only one (app.R). Even that took a little while because I messed up the commas. For this example I used the mtcars data set and allow the user to select the Y variable, X variable, the color variable and a variable to make multiple plots.

Here is a link to the shiny app with embedded rChart


Also here is the code to create this shiny App:

require(shiny)
require(rCharts)
require(datasets)

server<-function(input,output){
  output$myChart<-renderChart({
    p1<-rPlot(input$x,input$y, data=mtcars,type="point",color=input$color,facet=input$facet)
    p1$addParams(dom="myChart")
    return(p1)
  })
}

ui<-pageWithSidebar(
  headerPanel("Motor Trend Cars data with rCharts"),
  sidebarPanel(
    selectInput(inputId="y",
                label="Y Variable",
                choices=names(mtcars),
                ),
    selectInput(inputId="x",
                label="X Variable",
                choices=names(mtcars),
    ),
    selectInput(inputId="color",
                label="Color by Variable",
                choices=names(mtcars[,c(2,8,9,10,11)]),
    ),
    selectInput(inputId="facet",
                label="Facet by Variable",
                choices=names(mtcars[,c(2,8,9,10,11)]),
    )    
    ),
  mainPanel(
    showOutput("myChart","polycharts")
    )
  )

shinyApp(ui=ui,server=server)



Hopefully you can use my code as a template to speed the time it takes you to learn how to do this.

Thursday, November 13, 2014

An example a scatter plot using rChart to do the plot in R

Recently at the Connecticut R Users Meetup Aaron Goldenberg gave a talk where he gave examples using rCharts to do plots in R. rCharts is one of a group of R packages that leverages javascript graphics libraries to make them available in an easy way for R users. R has great graphics capability within it, but that graphics capability is static and not interactive. Development in the last few years by some people have created a number of ways to address add that capability. rCharts is an example of one of the better javascript graphic libraries and RStudio’ shiny is a good example on making those graphics truely interactive. The combining of the two is produces powerful and engaging graphics, and I encourage anyone who is thinking about doing that to try it. The results are impressive and fairly easy to create.
I am going to do a few quick examples to get you started using this great package.
First, you will need to load the rCharts package. This is a little different because rCharts in not on CRAN, but it is on Github. In order to install packages from Github you will need to first install the package devtools from CRAN on your computer.
install.packages("devtools")
library(devtools)
Now we have the ability to downlaod the rCharts package from Github.
install_github("rCharts","ramnathv")
library(rCharts)
Now we are ready to do some plots. Lets start with the old favorite mtcars Data set that comes with base R and do a simple scatter plot.
require(rCharts)
rPlot(mpg~wt, data=mtcars,type="point")



It looks just like a normal scatter plot in R, but not when you move the cursor over a point in the plot the data for that point pops up in a little box. this is so sweet. you can also make this scatter plot faceted
rPlot(mpg~wt|am, data=mtcars,type="point")



So now we have created two side by side graphs where the data is further divided by if the car is automatic or manual. We can further divide this data by making the colors of the points represent how many cylanders the cars has.
rPlot(mpg~wt|cyl, data=mtcars,color="cyl",type="point")

Pretty powerful stuff. In basically one line of code I am able to create an interactive scatter plot show four demsions of a data set!

I am sorry that the screenshot are a little weak, but I wanted to show how the hover function with the cursor works and that was the best I could do in the amount of time I had. I will try and do a better job in future posts.

Wednesday, November 12, 2014

Trend in Predictive Analytics

When I started working on Revolution Computing ( now called Revolution Analytics) the trend in analytics was to create ever more sophisticated models. The needs of the day were for faster computers and parallel computing to allow for the processing power to deal with all the computation that was needed to do these calculations. The data itself was relatively small at the time as was the number of people with the skill to do this kind of work.

The trend continued until 2013.  Everything was about more complex models and putting them together using ensemble methods to increase the predictive power of the model. Think about it. The Netflix prize was won by a group that basically put together 107 models in one massive hairball. Its predictive power was better than anything else, but it was completely opaque as how it worked, and it took forever to run. Netflix claims they never implemented the winning model basically because the engineering effort to implement the algorithm did not justify the performance improvement from a simpler algorithm.

Here is the write up from their blog:

In 2006 we announced the Netflix Prize, a machine learning and data mining competition for movie rating prediction. We offered $1 million to whoever improved the accuracy of our existing system called Cinematchby 10%. We conducted this competition to find new ways to improve the recommendations we provide to our members, which is a key part of our business. However, we had to come up with a proxy question that was easier to evaluate and quantify: the root mean squared error (RMSE) of the predicted rating. The race was on to beat our RMSE of 0.9525 with the finish line of reducing it to 0.8572 or less.
A year into the competition, the Korbell team won the first Progress Prize with an 8.43% improvement. They reported more than 2000 hours of work in order to come up with the final combination of 107 algorithms that gave them this prize. And, they gave us the source code. We looked at the two underlying algorithms with the best performance in the ensemble: Matrix Factorization (which the community generally called SVD,Singular Value Decomposition) and Restricted Boltzmann Machines (RBM). SVD by itself provided a 0.8914 RMSE, while RBM alone provided a competitive but slightly worse 0.8990 RMSE. A linear blend of these two reduced the error to 0.88. To put these algorithms to use, we had to work to overcome some limitations, for instance that they were built to handle 100 million ratings, instead of the more than 5 billion that we have, and that they were not built to adapt as members added more ratings. But once we overcame those challenges, we put the two algorithms into production, where they are still used as part of our recommendation engine.

If you followed the Prize competition, you might be wondering what happened with the final Grand Prize ensemble that won the $1M two years later. This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Also, our focus on improving Netflix personalization had shifted to the next level by then. In the remainder of this post we will explain how and why it has shifted.

That type of response signal the beginning of a change away from the "best" model to a simpler models that were easier to implement and easier to understand. That process is continuing today. The added side benefit to all this is that now the modeling can be done by a much larger group of people. This change has helped address the growth of in the size of the data and the lack of data scientist available to do the work.


Tuesday, November 11, 2014

My first Shiny App published to Shiny.io

Today I published my first Shiny App on Shiny.io. It is a very simple app that uses the built in mtcars data set and plots two of the variables against each other. It also uses a third variable to color the points of the scatter plot. I added a plot of the linear regression fit for fun. I really enjoy creating Shiny Apps, and there has been some demand in the consulting business to do them. I sometimes find the more strict format of Shiny to be frustrating. It looks like R, but it just does not let everything through like R does. As non programmer one of the things I love about R is that it is so flexible and forgiving.

Anyway, here is a link to my Shiny App on ShinyApp.io.

Shiny is web frame work for R.  It was created by the guys at RStudio. It basically allow you to deploy interactive R code through a webpage. This output could be in the form of text or graphs. The most common use is to deploy interactive graphs that let other people control the parameters of the plot using thing s call widgets. Widgets comes in all kinds of varieties, but basically the are things like check boxes and sliders that than change the elements being plotted. There is the additional option of using javascript plotting packages within the shiny framework so that you interact direct will some of the features of the plot. A good example of this is the rCharts package. I first started using the rCharts package because it allows the user to hover directly over a plotted point and for a window to open up with information about that point. The is a very useful feature.  Although write code for shiny looks like R, I will warn you that shiny is a little picky about format than I am used to in R. That pickiness about things like commas caused me a little pain in the beginning, but once you get used to it, it is not big deal.

The Shiny website is an excellent source of information on Shiny. It provides a great tutorial to get you started. It also provides a gallery of existing shiny apps that you can view and get ideas from. The gallery is also great because one of the best ways to learn shiny is to use the existing formats in the gallery to build your own shiny app. This was very helpful to me. I just copy over a few of these examples into my RStudio session and started modify the code to change things. use a different data set, change the variables, add regression line, etc.

Here is the link to Shiny

As you can see from the link to my app. The folks at RStudio have also created something called ShinyApp.io. ShinyApp.io is a place to deploy you shiny apps to on the web so anyone can view and interact with them. It is an excellent setup and ridiculously easy to use. You simply deploy you shiny app from you Rstudio environment, and there it is up and running.

Monday, November 10, 2014

Revolution R Open

Recently Revolution Analytics released Revolution R Open. This is their Open Source version of R that does have some enhancements over basic Open Source R. To me the most significant enhancement is the use of intel's MKL libraries which will speed up your computations especially if you use Windows as your operating system. Here is the Link to Revolution R Open where you can read more about it and download the software.



For those of you too lazy to click a link here is the text from Revolution Analytics blog announcing the release:

What is Revolution R Open

Revolution R Open is the enhanced distribution of R from Revolution Analytics.
Revolution R Open is based on the statistical software R (version 3.1.1) and adds additional capabilities for performance, reproducibility and platform support.
Just like R, Revolution R Open is open source and free to download, use, and share.
Revolution R Open includes:
Get Revolution R Open today! You can download and install Revolution R Open, free of charge.

R: A Complete Environment

R is a complete environment for data scientists.
Revolution R Open is built on R 3.1.1 from the R Foundation for Statistical Computing. R is the most widely-used language for statistics and data science, and is ranked the 9th most popular of all data science languages by the IEEE. R is used by leading companies around the world as part of data-driven applications in industries including finance, healthcare, technology, scientific research, media, government and academia.
The R language includes every data type, data manipulation, statistical model, and chart that the modern data scientist could ever need. Learn more about R here.

Total Compatibility

R developers have contributed thousands of free add-on packages for R, to further extend its capabilities for data handling, data visualization, statistical analysis and much more. Learn more about R packages here. Almost 6000 packages are available in CRAN (the Comprehensive R Archive Network), and you can browse packages by name or by topic area at MRAN. Even more packages can be found at GitHub (including the RHadoop packages to integrate R and Hadoop) or in theBioconductor repository. All packages that support R 3.1.1 are compatible with Revolution R Open.
Revolution R Open is also compatible with R user interfaces including RStudio, which we recommend as an excellent IDE for R developers. Applications that include the capability to call out to R are also compatible with Revolution R Open. If you would like to integrate R into your own application, DeployR Open is designed to work with Revolution R Open.

Multithreaded Performance (MKL)

From the very beginning R from the R Foundation was designed to use only a single thread (processor) at a time. Even today, R still works that way unless linked with a multi-threaded BLAS/LAPACK libraries.
The machines of today offer so much more in terms of processing power. To take advantage of this, Revolution R Open includes by default the Intel Math Kernel Library (MKL), which provides these BLAS and LAPACK library functions used by R. Intel MKL is multi-threaded and makes it possible for so many common R operations, such as matrix multiply/inverse, matrix decomposition, and some higher-level matrix operations, to use all of the processing power available.
Our tests show that linking to MKL improves the performance of your R code, especially where many vector/matrix operations are used. See these benchmarks. Performance improves with additional cores, meaning you can expect better results on a four-core laptop than on a two-core laptop--even on non-Intel hardware.
MKL's default behavior is to use as many parallel threads as there are available cores. There’s nothing you need to do to benefit from this performance improvement--not a single change to your R script is required. Learn how to control or restrict the number of threads.

Reliable R code (RRT)

Most R scripts rely on one or more CRAN packages, but packages on CRAN change daily. It can be difficult to write a script in R and then share it with others, or even run it on another system, and get the same results. Changes in package versions can result in your code generating errors or, even worse, generating incorrect results without warning.
Revolution R Open includes the Reproducible R Toolkit. The MRAN server archives the entire contents of CRAN on a daily basis, and the checkpoint function makes it easy to install the package versions required to reproduce your results reliably.

Platform Support

Supported Platforms. Revolution R Open is built and tested on the following 64-bit platforms:
  • Ubuntu 12.04, 14.04
  • CentOS / Red Hat Enterprise Linux 5.8, 6.5, 7.0
  • OS X Mavericks (10.9)
  • Windows® 7.0 (SP 1), 8.0, 8.1, Windows Server® 2008 R2 (SP1) and 2012
Experimental Platforms. Revolution R Open is also available for these Experimental platforms. While we expect it to work, RRO hasn’t been completely tested on these platforms. Let us know if you encounter problems.
  • OpenSUSE 13.1
  • OS X Yosemite (10.10)

To learn about other system requirements, read more in our installation guide.

Help and Resources

Revolution R Open provides detailed installation instructions and learning resources to help you get started with R.
Visit the Revolution R Open Google Group for discussions with other users.
Technical support and a limited warranty for Revolution R Open is available with a subscription to Revolution R Plus from Revolution Analytics.

A simple video on Decision Trees with an Example

I found this nice video example to example how decision trees work. I think it is a good easy example to understand what is going on.


Saturday, November 8, 2014

Funny Common Core Math Video

Common Core Math is simply new math under a new name. It is still a bad way to teach math developed by a group of people who should not teach math. When I was in school new math was all the rage. It was a terrible way to teach math to children that failed a generation of students. Tom Lehrer ridiculed the method in song. Eventually the new math concept was abandoned. Sadly new math has reappeared under the new name Common Core. So it is time to bring out Tom Lehrer old song and make fun of Common Core. It really is pretty funny.

Friday, November 7, 2014

Linear Regression in R example video


Here is a an example of Linear Regression in R. I have done a post that showed how to do this, but sometimes a video example is better.



Random Forest video Tutorial

Here is a pretty good and short tutorial on Random Forests. Sometimes I pick things up quick when I watch an video demo. This was the case with decision trees for me. I just was not understanding what was going on when I was reading it.




Thursday, November 6, 2014

The SVD Song - Too Funny!

While searching through some tutorial videos on youtube I ran across this video of the SVD song. It is totally hilarious. I would also say it does a pretty good job of explaining what a Singular Vector Decomposition (SVD) is.






Rstudio publishes new Shiny Cheat Sheet - Perfect for the Coursera Developing Data Products Class

Today RStudio published a Cheat Sheet for it Shiny Application. This is a great reference to new shiny users especially those who are taking online classes. The Coursera Data Science classes make extensive use of the shiny application, and this Cheat Sheet makes it much easier to get started with shiny. I would also recommend doing the shiny tutorial on the Rstudio website as it is excellent.








Data Scientist Max Kuhn presents his Caret Package

Here is the video of an excellent interview with Max Kuhn, the creator of the Caret Package. Caret is a key package for R users that incorporates about 180 predictive models into a single package. He is also the author of Applied Predictive Models.


A example of Bitcoin exchange

Using R and SciDB, Bryan Lewis did this talk using an example of Bitcoin exchange using graph algorithms.


Tuesday, November 4, 2014

Predicting Injuries to NFL Players for Fantasy Football

With the number of people using player predictions for fantasy football I was surprised to find that few if any of these predictions include a factor for the  chance an NFL player gets injured. This is a critical factor in the decision process because a less durable player can cost you a week if he is injured in a game or a season if you draft a guy and he is out for the season. I have always found the NFL is the hardest to predict of the professional sports in the US because of the short season and limited number of events in each game. Basically it because for me a rare event problem. I wondered if anyone had looked at this and come up with a solution. I learned long ago as a R programmer that if you wanted to do something always look to see if someone had already build a package for it because they usually had. The are a few companies that have in fact worked on this problem and are selling their results to interested parties. On such company is Sports Injury Predictor, They claim a 60% accuracy rate in prediction but the do not define that in terms of time period in which it is accurate or it potential impact on fantasy team result which is actually the outcome we are concerned with.

Monday, November 3, 2014

An Example of Parallel Computing Performance using DoMc for a Random Forest Model in Caret in R.

Parallel R I was working on a project using Random Forest from the Caret Package in R. Since it was not a tiny data set and Random Forest tends to run fairly slowly I was utilizing the built in parallel capability of Caret which has Foreach R and a number of parallel backends available to it. Since I have a four core laptop I use the DoMC backend and set the number of Cores equal to 4. I know it sounds odd but it got me wondering if 4 was the optimal number workers (remember when you register the number it is not really the number of cores you have on your machine, but the number of workers you create to do recieve jobs). I mean would it be better if I registered five workers so that when one worker finished a job there would always be a worker ready for that core that just opened up. On the ther hand would I be better with 3 workers which would leave one core free to act as a head node ( I have heard this improves performance when you use MPI, but I have never actually used this approach). So I decided to do some tests to finds out. First Load in the required packages require(caret) ## Loading required package: caret ## Loading required package: lattice ## Loading required package: ggplot2 require(ggplot2) require(randomForest) ## Loading required package: randomForest ## randomForest 4.6-10 ## Type rfNews() to see new features/changes/bug fixes. require(doMC) ## Loading required package: doMC ## Loading required package: foreach ## Loading required package: iterators ## Loading required package: parallel Then I read in the data set that I was palying with training_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv" test_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv" training<-read.csv(training_URL,na.strings=c("NA","")) test<-read.csv(test_URL,na.strings=c("NA","")) Then I got rid of the columns that is simply an index, timestamp or username. training<-training[,7:160] test<-test[,7:160] Remove the columns that are mostly NAs. I wanted this to run in a reasonable amount of time even if it was on a single core. mostly_data<-apply(!is.na(training),2,sum)>19621 training<-training[,mostly_data] test<-test[,mostly_data] dim(training) ## [1] 19622 54 I partitioned the training set into a smaller set called training1 really to speed up the running of the model. Again really to get it to run in a reasonable amount of time InTrain<-createDataPartition(y=training$classe,p=0.3,list=FALSE) training1<-training[InTrain,] To establish a base line I ran the model on a single worker. registerDoMC(cores=1) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time1<-proc.time()-ptm print(time1) ## user system elapsed ## 737.271 5.038 742.307 Next try it with three workers registerDoMC(cores=3) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time3<-proc.time()-ptm print(time3) ## user system elapsed ## 323.22 2.58 209.38 Now four workers registerDoMC(cores=4) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time4<-proc.time()-ptm print(time4) ## user system elapsed ## 556.600 4.688 178.345 And finally 5 workers registerDoMC(cores=5) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time5<-proc.time()-ptm print(time5) ## user system elapsed ## 503.992 4.991 158.250 Also I want to check for 4 workers with four cross validations registerDoMC(cores=4) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=4), prox=TRUE,allowParallel=TRUE) time6<-proc.time()-ptm print(time6) ## user system elapsed ## 528.90 5.08 132.72 print(time1) ## user system elapsed ## 737.271 5.038 742.307 print(time3) ## user system elapsed ## 323.22 2.58 209.38 print(time4) ## user system elapsed ## 556.600 4.688 178.345 print(time5) ## user system elapsed ## 503.992 4.991 158.250 print(time6) ## user system elapsed ## 528.90 5.08 132.72 I actually ran this analysis a number of times, and consistently setting the number of workers to 5 on my 4 core machine yielded the best performance.

Tuesday, October 28, 2014

An example of using Random Forest in Caret with R.

Here is an example of using Random Forest in the Caret Package with R.
First Load in the required packages
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(ggplot2)
require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
Read in the Training and Test Set.
training_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training<-read.csv(training_URL,na.strings=c("NA",""))
test<-read.csv(test_URL,na.strings=c("NA",""))
Then I got rid of the columns that is simply an index, timestamp or username.
training<-training[,7:160]
test<-test[,7:160]
Remove the columns that are mostly NAs. They could be useful in the model, but it is easier to cut the data.frame down and see if it gives good results
mostly_data<-apply(!is.na(training),2,sum)>19621
training<-training[,mostly_data]
test<-test[,mostly_data]
dim(training)
## [1] 19622    54
I partitioned the training set into a smaller set called training1 really to speed up the running of the model
InTrain<-createDataPartition(y=training$classe,p=0.3,list=FALSE)
training1<-training[InTrain,]
So I used caret with random forest as my model with 5 fold cross validation
rf_model<-train(classe~.,data=training1,method="rf",
                trControl=trainControl(method="cv",number=5),
                prox=TRUE,allowParallel=TRUE)
print(rf_model)
## Random Forest 
## 
## 5889 samples
##   53 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##    2    1         1      0.006        0.008   
##   27    1         1      0.005        0.006   
##   53    1         1      0.006        0.007   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.
print(rf_model$finalModel)
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE,      allowParallel = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.88%
## Confusion matrix:
##      A    B    C   D    E class.error
## A 1674    0    0   0    0     0.00000
## B   11 1119    9   1    0     0.01842
## C    0   11 1015   1    0     0.01168
## D    0    2   10 952    1     0.01347
## E    0    1    0   5 1077     0.00554
That is a pretty amazingly good model! .987 accuracy! I usually hope for something in the >.7 in my real work.

Friday, October 24, 2014

Where does the Boston Bruin's Bar Bill rank in the big bar bills ever

In 2011 the Boston Bruins went to a bar at MGM Grand Casino in Connecticut after they won the Stanley cup and spent $156K. That seemed to me to be a crazy massive bill although the bulk of it was made up of one massively expensive bottle of wine. I wrote about that back in 2011. The funny thing is that this bill is only ranked the sixth largest bar bill ever! Now most of the other bills are from Russian billionaires and investment bankers, but there is a bill from Lebron James in Las Vegas that is larger than the Bruin's bill by about $30K. I do not think Lebron will top that bill back in Cleveland because they just do not have that kind of expensive booze in Ohio. Here is the list of most expensive bills ever.

The funny thing about all these events is that a major portion of the bill is made up of a ridiculously expensive large format bottle of wine (usually champagne). Without these parties these bottles would never sell. After the party these bottles are usually signed and put on display at the club that continues to use that party to advertise their club as the MGM does with the Bruin's bottle. It make you wonder if these people ever really paid for these parties or if there was some side advertising deal. Party on...

Thursday, October 23, 2014

A Project using the data from the Boston Subway System - MTA

I was in Boston yesterday, and the group that I was in ended up talking about the Boston Subway system called the MTA. We ended up talking about projects that have used the recently opened up Data on the MTAsystem. One of the best visualizations I have seen on the system is by Mike Barry and Brian Card on Github. I provide the link here for all those who want to enjoy their work.

Tuesday, October 21, 2014

Coursera Data Mining Class uses the Caret Package by Max Kuhn

I have been taking the Coursera Data Science track for fun over the last couple of months. Each class is about a month and it is all in R which is great. Although the classes are fairly basic I have found them enjoyable to do, and some of their examples have given me nicer ways to do things than how I have done the operations in the past. The eight class in the series is called Practical Machine Learning.  So far it has been a great ride through the Max Kuhn's Caret package. I have been using this package since 2008. I always believed that Caret had become the defacto Machine Learning package in R. Part of the reason for this is that it contains something like 187 different models within the package. The main reason for me is the unified interface to those models make it easy to try models that you are not an expert in. This makes the process of modeling better because it used to be people only used models they new well doused often which might not be the best model for the data they are working on. Caret lowers the barriers for model uses and open the door to better and more robust prediction.

If you have never used the Caret package you should try it in the Coursera Class. If you have and want to learn more here is the website. Also their is Max's book which uses Caret called Applied Predictive Modeling.

Sunday, October 19, 2014

Boston R Meetup with Bryan Lewis, R, Shiny, RStudio and SciDB

Bryan Lewis is Paradigm4's Chief Data Scientist and a partner at Big Computing. He is the author of a number of R packages including SciDB-R, and has worked on many performance-related aspects of R. Bryan is an applied mathematician, kayaker, and avid amateur mycologist. 

For years Bryan has been excited about the potential of computational data bases like SciDB to provide fast analytics of truly massive data sets. I have seen Bryan give this talk and it is well worth your time.  In addition he has added Rstudio and Shiny to his talk which makes this a must go meet up in Boston on Wednesday night. Shiny has really enhance the ability to communicate results to non-experts through interactive visualizations that are easy to create and modify.

Here is the abstract to his talk:


R is a powerful system for computation and visualization widely used in biostatistics and analyses of genomic data. An active and engaged research community continuously expands R's capabilities through thousands of available packages on CRAN and Bioconductor. 
SciDB is a scalable open-source database used in large omics workloads like the NCBI 1000 genomes project and other genomic variant applications, applications derived from the cancer genome atlas, and more. 
The `scidb` package for R available on CRAN lets R researchers use SciDB on very large datasets directly from R without learning a new database query language. Bryan will use RStudio, Shiny, and SciDB to demonstrate a few common, real-world genomics analysis workflows in use today by SciDB-R users. We'll see basic techniques like enrichment problems using fast parallel Fisher tests and also more challenging problems like large-scale correlation and network analysis. 
Here is the link to the meetup.

Friday, October 17, 2014

R Statistical Programming Training and the Shift in R Consulting.

At Bigcomputing we have been R consultants since 2009. We have provided consulting in a variety of verticals to build, improve or speed up the algorithms of our customers, and we have provided R training.  Over the last few years there has been a shift in the data science consulting business that no one is talking about.

In 2009 working on algorithms represented 90% of our business, training 10% and visualization 0%. This distribution was similar for all the R consulting groups that I knew of at the time and held that way for a number of years.

Starting in 2012 and continuing till today. Algorithm building has become a smaller and smaller part of the business while the demand for training and interactive visualization tools has exploded.
How big has that shift been? I know of one consulting group that has stopped doing anything but training and had to triple in size to meet the demand.

We have experienced a similar shift. Now our business is made up of mostly training and visualizations. Although Bigcomputing has not abandoned the algorithmic work. Why/ Because we believe that after the people we have trained develop their skills they will want and need to do more. We can only help them to do that if we are doing that more sophisticated work already with the most current developments.


Underneath this change I feel there is also a shift in the data science world away from the most accurate models to a more standard robust and interpretable modeling approach. These models tend to be easily visualized in an interactive way that can be shared with non- data scientists not in a result oriented way but in an informative way. This is a powerful shift because if we adopt the simpler approaches they can be done by a much larger group of people in an effective way. For example if I use a simple model in R, I can typically do that with one line of code versus the many lines of code for some type of ensemble method. Anyone can write one line of code. All they need is a little bit of training which we are giving them creating.

Wednesday, October 15, 2014

Interesting Article on Making Money in Fantasy Sport.

The New York Time posted a story about a guy who has made a reasonable profit on Fantasy Sports sites. Here is a link to the Site.

I have done work in this area both for fun and profit. Years ago Drew Conway started a machine learning fantasy football league which I played in. I think it is fair to say that the results of that league were unimpressive even though it had within it some the most successful predictive analytics guys in the country. Later we did some work on picking inefficiencies in the sports betting market which was successful. It made a little bit of money week after week. It was a interesting project, but it simply was not profitable enough to justify the work involved.

Over time we learned a little about which sports are easier to do prediction in. The hardest by far of the pro sports was the NFL. The relatively few games in the NFL is a big reason for this.  However the NFL is not equally difficult throughout the season. The first three and last three weeks are the one that are the hardest to predict with the middle weeks being reasonable. Interestingly enough we found the NBA with its numerous games, small number of playesr and other factors was consistently the best data to work with either on a player level for fantasy or on a team level for sports betting. Although we also found that the sport betting market for the NBA was an efficient one compared to other sports so any change needed to be taken advantage of quickly. I found it interesting that the NBA is the sport that the New York Times article focused on because it is where I would focus on as well.