Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: November 2014

Tuesday, November 18, 2014

An example of a Shiny App using a rCharts scatter plot

I was finally able to do an example of a Shiny App using the rCharts scatter plot. Switching this example from a standard R scatter plot to rCharts took me way longer than I expected. Now that I have finished it, I am not exactly sure why it was so challenging because the final code looks pretty simple and clean. I also changed my code from the original app so that instead of two R code files (ui.R and server.R) there is only one (app.R). Even that took a little while because I messed up the commas. For this example I used the mtcars data set and allow the user to select the Y variable, X variable, the color variable and a variable to make multiple plots.

Here is a link to the shiny app with embedded rChart


Also here is the code to create this shiny App:

require(shiny)
require(rCharts)
require(datasets)

server<-function(input,output){
  output$myChart<-renderChart({
    p1<-rPlot(input$x,input$y, data=mtcars,type="point",color=input$color,facet=input$facet)
    p1$addParams(dom="myChart")
    return(p1)
  })
}

ui<-pageWithSidebar(
  headerPanel("Motor Trend Cars data with rCharts"),
  sidebarPanel(
    selectInput(inputId="y",
                label="Y Variable",
                choices=names(mtcars),
                ),
    selectInput(inputId="x",
                label="X Variable",
                choices=names(mtcars),
    ),
    selectInput(inputId="color",
                label="Color by Variable",
                choices=names(mtcars[,c(2,8,9,10,11)]),
    ),
    selectInput(inputId="facet",
                label="Facet by Variable",
                choices=names(mtcars[,c(2,8,9,10,11)]),
    )    
    ),
  mainPanel(
    showOutput("myChart","polycharts")
    )
  )

shinyApp(ui=ui,server=server)



Hopefully you can use my code as a template to speed the time it takes you to learn how to do this.

Thursday, November 13, 2014

An example a scatter plot using rChart to do the plot in R

Recently at the Connecticut R Users Meetup Aaron Goldenberg gave a talk where he gave examples using rCharts to do plots in R. rCharts is one of a group of R packages that leverages javascript graphics libraries to make them available in an easy way for R users. R has great graphics capability within it, but that graphics capability is static and not interactive. Development in the last few years by some people have created a number of ways to address add that capability. rCharts is an example of one of the better javascript graphic libraries and RStudio’ shiny is a good example on making those graphics truely interactive. The combining of the two is produces powerful and engaging graphics, and I encourage anyone who is thinking about doing that to try it. The results are impressive and fairly easy to create.
I am going to do a few quick examples to get you started using this great package.
First, you will need to load the rCharts package. This is a little different because rCharts in not on CRAN, but it is on Github. In order to install packages from Github you will need to first install the package devtools from CRAN on your computer.
install.packages("devtools")
library(devtools)
Now we have the ability to downlaod the rCharts package from Github.
install_github("rCharts","ramnathv")
library(rCharts)
Now we are ready to do some plots. Lets start with the old favorite mtcars Data set that comes with base R and do a simple scatter plot.
require(rCharts)
rPlot(mpg~wt, data=mtcars,type="point")



It looks just like a normal scatter plot in R, but not when you move the cursor over a point in the plot the data for that point pops up in a little box. this is so sweet. you can also make this scatter plot faceted
rPlot(mpg~wt|am, data=mtcars,type="point")



So now we have created two side by side graphs where the data is further divided by if the car is automatic or manual. We can further divide this data by making the colors of the points represent how many cylanders the cars has.
rPlot(mpg~wt|cyl, data=mtcars,color="cyl",type="point")

Pretty powerful stuff. In basically one line of code I am able to create an interactive scatter plot show four demsions of a data set!

I am sorry that the screenshot are a little weak, but I wanted to show how the hover function with the cursor works and that was the best I could do in the amount of time I had. I will try and do a better job in future posts.

Wednesday, November 12, 2014

Trend in Predictive Analytics

When I started working on Revolution Computing ( now called Revolution Analytics) the trend in analytics was to create ever more sophisticated models. The needs of the day were for faster computers and parallel computing to allow for the processing power to deal with all the computation that was needed to do these calculations. The data itself was relatively small at the time as was the number of people with the skill to do this kind of work.

The trend continued until 2013.  Everything was about more complex models and putting them together using ensemble methods to increase the predictive power of the model. Think about it. The Netflix prize was won by a group that basically put together 107 models in one massive hairball. Its predictive power was better than anything else, but it was completely opaque as how it worked, and it took forever to run. Netflix claims they never implemented the winning model basically because the engineering effort to implement the algorithm did not justify the performance improvement from a simpler algorithm.

Here is the write up from their blog:

In 2006 we announced the Netflix Prize, a machine learning and data mining competition for movie rating prediction. We offered $1 million to whoever improved the accuracy of our existing system called Cinematchby 10%. We conducted this competition to find new ways to improve the recommendations we provide to our members, which is a key part of our business. However, we had to come up with a proxy question that was easier to evaluate and quantify: the root mean squared error (RMSE) of the predicted rating. The race was on to beat our RMSE of 0.9525 with the finish line of reducing it to 0.8572 or less.
A year into the competition, the Korbell team won the first Progress Prize with an 8.43% improvement. They reported more than 2000 hours of work in order to come up with the final combination of 107 algorithms that gave them this prize. And, they gave us the source code. We looked at the two underlying algorithms with the best performance in the ensemble: Matrix Factorization (which the community generally called SVD,Singular Value Decomposition) and Restricted Boltzmann Machines (RBM). SVD by itself provided a 0.8914 RMSE, while RBM alone provided a competitive but slightly worse 0.8990 RMSE. A linear blend of these two reduced the error to 0.88. To put these algorithms to use, we had to work to overcome some limitations, for instance that they were built to handle 100 million ratings, instead of the more than 5 billion that we have, and that they were not built to adapt as members added more ratings. But once we overcame those challenges, we put the two algorithms into production, where they are still used as part of our recommendation engine.

If you followed the Prize competition, you might be wondering what happened with the final Grand Prize ensemble that won the $1M two years later. This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Also, our focus on improving Netflix personalization had shifted to the next level by then. In the remainder of this post we will explain how and why it has shifted.

That type of response signal the beginning of a change away from the "best" model to a simpler models that were easier to implement and easier to understand. That process is continuing today. The added side benefit to all this is that now the modeling can be done by a much larger group of people. This change has helped address the growth of in the size of the data and the lack of data scientist available to do the work.


Tuesday, November 11, 2014

My first Shiny App published to Shiny.io

Today I published my first Shiny App on Shiny.io. It is a very simple app that uses the built in mtcars data set and plots two of the variables against each other. It also uses a third variable to color the points of the scatter plot. I added a plot of the linear regression fit for fun. I really enjoy creating Shiny Apps, and there has been some demand in the consulting business to do them. I sometimes find the more strict format of Shiny to be frustrating. It looks like R, but it just does not let everything through like R does. As non programmer one of the things I love about R is that it is so flexible and forgiving.

Anyway, here is a link to my Shiny App on ShinyApp.io.

Shiny is web frame work for R.  It was created by the guys at RStudio. It basically allow you to deploy interactive R code through a webpage. This output could be in the form of text or graphs. The most common use is to deploy interactive graphs that let other people control the parameters of the plot using thing s call widgets. Widgets comes in all kinds of varieties, but basically the are things like check boxes and sliders that than change the elements being plotted. There is the additional option of using javascript plotting packages within the shiny framework so that you interact direct will some of the features of the plot. A good example of this is the rCharts package. I first started using the rCharts package because it allows the user to hover directly over a plotted point and for a window to open up with information about that point. The is a very useful feature.  Although write code for shiny looks like R, I will warn you that shiny is a little picky about format than I am used to in R. That pickiness about things like commas caused me a little pain in the beginning, but once you get used to it, it is not big deal.

The Shiny website is an excellent source of information on Shiny. It provides a great tutorial to get you started. It also provides a gallery of existing shiny apps that you can view and get ideas from. The gallery is also great because one of the best ways to learn shiny is to use the existing formats in the gallery to build your own shiny app. This was very helpful to me. I just copy over a few of these examples into my RStudio session and started modify the code to change things. use a different data set, change the variables, add regression line, etc.

Here is the link to Shiny

As you can see from the link to my app. The folks at RStudio have also created something called ShinyApp.io. ShinyApp.io is a place to deploy you shiny apps to on the web so anyone can view and interact with them. It is an excellent setup and ridiculously easy to use. You simply deploy you shiny app from you Rstudio environment, and there it is up and running.

Monday, November 10, 2014

Revolution R Open

Recently Revolution Analytics released Revolution R Open. This is their Open Source version of R that does have some enhancements over basic Open Source R. To me the most significant enhancement is the use of intel's MKL libraries which will speed up your computations especially if you use Windows as your operating system. Here is the Link to Revolution R Open where you can read more about it and download the software.



For those of you too lazy to click a link here is the text from Revolution Analytics blog announcing the release:

What is Revolution R Open

Revolution R Open is the enhanced distribution of R from Revolution Analytics.
Revolution R Open is based on the statistical software R (version 3.1.1) and adds additional capabilities for performance, reproducibility and platform support.
Just like R, Revolution R Open is open source and free to download, use, and share.
Revolution R Open includes:
Get Revolution R Open today! You can download and install Revolution R Open, free of charge.

R: A Complete Environment

R is a complete environment for data scientists.
Revolution R Open is built on R 3.1.1 from the R Foundation for Statistical Computing. R is the most widely-used language for statistics and data science, and is ranked the 9th most popular of all data science languages by the IEEE. R is used by leading companies around the world as part of data-driven applications in industries including finance, healthcare, technology, scientific research, media, government and academia.
The R language includes every data type, data manipulation, statistical model, and chart that the modern data scientist could ever need. Learn more about R here.

Total Compatibility

R developers have contributed thousands of free add-on packages for R, to further extend its capabilities for data handling, data visualization, statistical analysis and much more. Learn more about R packages here. Almost 6000 packages are available in CRAN (the Comprehensive R Archive Network), and you can browse packages by name or by topic area at MRAN. Even more packages can be found at GitHub (including the RHadoop packages to integrate R and Hadoop) or in theBioconductor repository. All packages that support R 3.1.1 are compatible with Revolution R Open.
Revolution R Open is also compatible with R user interfaces including RStudio, which we recommend as an excellent IDE for R developers. Applications that include the capability to call out to R are also compatible with Revolution R Open. If you would like to integrate R into your own application, DeployR Open is designed to work with Revolution R Open.

Multithreaded Performance (MKL)

From the very beginning R from the R Foundation was designed to use only a single thread (processor) at a time. Even today, R still works that way unless linked with a multi-threaded BLAS/LAPACK libraries.
The machines of today offer so much more in terms of processing power. To take advantage of this, Revolution R Open includes by default the Intel Math Kernel Library (MKL), which provides these BLAS and LAPACK library functions used by R. Intel MKL is multi-threaded and makes it possible for so many common R operations, such as matrix multiply/inverse, matrix decomposition, and some higher-level matrix operations, to use all of the processing power available.
Our tests show that linking to MKL improves the performance of your R code, especially where many vector/matrix operations are used. See these benchmarks. Performance improves with additional cores, meaning you can expect better results on a four-core laptop than on a two-core laptop--even on non-Intel hardware.
MKL's default behavior is to use as many parallel threads as there are available cores. There’s nothing you need to do to benefit from this performance improvement--not a single change to your R script is required. Learn how to control or restrict the number of threads.

Reliable R code (RRT)

Most R scripts rely on one or more CRAN packages, but packages on CRAN change daily. It can be difficult to write a script in R and then share it with others, or even run it on another system, and get the same results. Changes in package versions can result in your code generating errors or, even worse, generating incorrect results without warning.
Revolution R Open includes the Reproducible R Toolkit. The MRAN server archives the entire contents of CRAN on a daily basis, and the checkpoint function makes it easy to install the package versions required to reproduce your results reliably.

Platform Support

Supported Platforms. Revolution R Open is built and tested on the following 64-bit platforms:
  • Ubuntu 12.04, 14.04
  • CentOS / Red Hat Enterprise Linux 5.8, 6.5, 7.0
  • OS X Mavericks (10.9)
  • Windows® 7.0 (SP 1), 8.0, 8.1, Windows Server® 2008 R2 (SP1) and 2012
Experimental Platforms. Revolution R Open is also available for these Experimental platforms. While we expect it to work, RRO hasn’t been completely tested on these platforms. Let us know if you encounter problems.
  • OpenSUSE 13.1
  • OS X Yosemite (10.10)

To learn about other system requirements, read more in our installation guide.

Help and Resources

Revolution R Open provides detailed installation instructions and learning resources to help you get started with R.
Visit the Revolution R Open Google Group for discussions with other users.
Technical support and a limited warranty for Revolution R Open is available with a subscription to Revolution R Plus from Revolution Analytics.

A simple video on Decision Trees with an Example

I found this nice video example to example how decision trees work. I think it is a good easy example to understand what is going on.


Saturday, November 8, 2014

Funny Common Core Math Video

Common Core Math is simply new math under a new name. It is still a bad way to teach math developed by a group of people who should not teach math. When I was in school new math was all the rage. It was a terrible way to teach math to children that failed a generation of students. Tom Lehrer ridiculed the method in song. Eventually the new math concept was abandoned. Sadly new math has reappeared under the new name Common Core. So it is time to bring out Tom Lehrer old song and make fun of Common Core. It really is pretty funny.

Friday, November 7, 2014

Linear Regression in R example video


Here is a an example of Linear Regression in R. I have done a post that showed how to do this, but sometimes a video example is better.



Random Forest video Tutorial

Here is a pretty good and short tutorial on Random Forests. Sometimes I pick things up quick when I watch an video demo. This was the case with decision trees for me. I just was not understanding what was going on when I was reading it.




Thursday, November 6, 2014

The SVD Song - Too Funny!

While searching through some tutorial videos on youtube I ran across this video of the SVD song. It is totally hilarious. I would also say it does a pretty good job of explaining what a Singular Vector Decomposition (SVD) is.






Rstudio publishes new Shiny Cheat Sheet - Perfect for the Coursera Developing Data Products Class

Today RStudio published a Cheat Sheet for it Shiny Application. This is a great reference to new shiny users especially those who are taking online classes. The Coursera Data Science classes make extensive use of the shiny application, and this Cheat Sheet makes it much easier to get started with shiny. I would also recommend doing the shiny tutorial on the Rstudio website as it is excellent.








Data Scientist Max Kuhn presents his Caret Package

Here is the video of an excellent interview with Max Kuhn, the creator of the Caret Package. Caret is a key package for R users that incorporates about 180 predictive models into a single package. He is also the author of Applied Predictive Models.


A example of Bitcoin exchange

Using R and SciDB, Bryan Lewis did this talk using an example of Bitcoin exchange using graph algorithms.


Tuesday, November 4, 2014

Predicting Injuries to NFL Players for Fantasy Football

With the number of people using player predictions for fantasy football I was surprised to find that few if any of these predictions include a factor for the  chance an NFL player gets injured. This is a critical factor in the decision process because a less durable player can cost you a week if he is injured in a game or a season if you draft a guy and he is out for the season. I have always found the NFL is the hardest to predict of the professional sports in the US because of the short season and limited number of events in each game. Basically it because for me a rare event problem. I wondered if anyone had looked at this and come up with a solution. I learned long ago as a R programmer that if you wanted to do something always look to see if someone had already build a package for it because they usually had. The are a few companies that have in fact worked on this problem and are selling their results to interested parties. On such company is Sports Injury Predictor, They claim a 60% accuracy rate in prediction but the do not define that in terms of time period in which it is accurate or it potential impact on fantasy team result which is actually the outcome we are concerned with.

Monday, November 3, 2014

An Example of Parallel Computing Performance using DoMc for a Random Forest Model in Caret in R.

Parallel R I was working on a project using Random Forest from the Caret Package in R. Since it was not a tiny data set and Random Forest tends to run fairly slowly I was utilizing the built in parallel capability of Caret which has Foreach R and a number of parallel backends available to it. Since I have a four core laptop I use the DoMC backend and set the number of Cores equal to 4. I know it sounds odd but it got me wondering if 4 was the optimal number workers (remember when you register the number it is not really the number of cores you have on your machine, but the number of workers you create to do recieve jobs). I mean would it be better if I registered five workers so that when one worker finished a job there would always be a worker ready for that core that just opened up. On the ther hand would I be better with 3 workers which would leave one core free to act as a head node ( I have heard this improves performance when you use MPI, but I have never actually used this approach). So I decided to do some tests to finds out. First Load in the required packages require(caret) ## Loading required package: caret ## Loading required package: lattice ## Loading required package: ggplot2 require(ggplot2) require(randomForest) ## Loading required package: randomForest ## randomForest 4.6-10 ## Type rfNews() to see new features/changes/bug fixes. require(doMC) ## Loading required package: doMC ## Loading required package: foreach ## Loading required package: iterators ## Loading required package: parallel Then I read in the data set that I was palying with training_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv" test_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv" training<-read.csv(training_URL,na.strings=c("NA","")) test<-read.csv(test_URL,na.strings=c("NA","")) Then I got rid of the columns that is simply an index, timestamp or username. training<-training[,7:160] test<-test[,7:160] Remove the columns that are mostly NAs. I wanted this to run in a reasonable amount of time even if it was on a single core. mostly_data<-apply(!is.na(training),2,sum)>19621 training<-training[,mostly_data] test<-test[,mostly_data] dim(training) ## [1] 19622 54 I partitioned the training set into a smaller set called training1 really to speed up the running of the model. Again really to get it to run in a reasonable amount of time InTrain<-createDataPartition(y=training$classe,p=0.3,list=FALSE) training1<-training[InTrain,] To establish a base line I ran the model on a single worker. registerDoMC(cores=1) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time1<-proc.time()-ptm print(time1) ## user system elapsed ## 737.271 5.038 742.307 Next try it with three workers registerDoMC(cores=3) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time3<-proc.time()-ptm print(time3) ## user system elapsed ## 323.22 2.58 209.38 Now four workers registerDoMC(cores=4) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time4<-proc.time()-ptm print(time4) ## user system elapsed ## 556.600 4.688 178.345 And finally 5 workers registerDoMC(cores=5) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time5<-proc.time()-ptm print(time5) ## user system elapsed ## 503.992 4.991 158.250 Also I want to check for 4 workers with four cross validations registerDoMC(cores=4) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=4), prox=TRUE,allowParallel=TRUE) time6<-proc.time()-ptm print(time6) ## user system elapsed ## 528.90 5.08 132.72 print(time1) ## user system elapsed ## 737.271 5.038 742.307 print(time3) ## user system elapsed ## 323.22 2.58 209.38 print(time4) ## user system elapsed ## 556.600 4.688 178.345 print(time5) ## user system elapsed ## 503.992 4.991 158.250 print(time6) ## user system elapsed ## 528.90 5.08 132.72 I actually ran this analysis a number of times, and consistently setting the number of workers to 5 on my 4 core machine yielded the best performance.