Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing

Tuesday, July 14, 2015

An example of using Cubist in R for prediction

An example of Cubist for prediction in R
Recently I was given a moderate sized data set to try and do a quick prediction on. I did not have a lot of time. When I pulled up the data set it had 5000 row and 254 predictors. 250 of the predictors were continuous and 4 were catagorical. Each column had 100s if NA , and I really did not feel like going through and using imputation. The outcome variable was continuous.I decides to use Quinlan’s Cubist in R which is a package maintained by Max Kuhn.
I was impressed by how quickly it ran and how good the the results were.
Below is the code of how I did the work using the Caret and Cubist Package in R.
First, I added the Caret Pacakage and the Cubist Package:
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(Cubist)
## Loading required package: Cubist
Then I read in the data set. Here is the summary of the structure as well
predictors <- read.csv("trainPredictors.csv")
predictors <-predictors[,-1]
outcomes <- read.csv("trainOutcomes.csv")
outcomes<- outcomes[,-1]
dim(predictors)
## [1] 5000  254
I used caret to make a training and test set of the data. I chose this to be a 80/20 split. I also split out the outcomes from the predictors in both the training and test set
inTrain<-createDataPartition(y = outcomes, p= .80)
inTrain<-unlist(inTrain)
trainpredictors<-predictors[inTrain,]
trainoutcomes<-outcomes[inTrain]
testpredictors<-predictors[-inTrain,]
testoutcomes<-outcomes[-inTrain]
Then I simply ran the model. Notice how quickly it ran
modelTree<- cubist(x = trainpredictors,y = trainoutcomes)
Next I used that model to do a prediction on the test set
mtPred<-predict(modelTree,testpredictors)
Finally I did an R^2 measure to see how it did
cor(mtPred,testoutcomes)^2
## [1] 0.840342
This is great result for not much effort!

Monday, July 13, 2015

Fastest way to read a CSV file into R

So I thought it would be really help to see just what the difference is between the two methods. FOr this example I an still using a relatively small data set. It is a little over five and a half million rows by six columns.
So for the read.csv function built in R
## Start timer
ptm<-proc.time()
test1<-read.csv("baby_data.csv")
## Stop timer and print time
ptm<-proc.time()-ptm
dim(test1)
## [1] 5674089       6
print(ptm)
##    user  system elapsed 
##  33.427   0.495  33.945

for the fread function in the data.table
## Start timer
ptm<-proc.time()
require(data.table)
## Loading required package: data.table
test2<-fread("baby_data.csv")
## 
Read 64.9% of 5674089 rows
Read 86.2% of 5674089 rows
Read 5674089 rows and 6 (of 6) columns from 0.187 GB file in 00:00:05
ptm<-proc.time()-ptm
print(ptm)
##    user  system elapsed 
##   4.027   0.190   4.224

As you can see fread() is almost 10 times faster than read.csv to process this data set. That is pretty amazing. There is also a package called readr by Hadley Wickham that is a little slower than data.table but has some nice added features.

Thursday, July 2, 2015

The R Consortium

One of the new stories to come out UseR! 2015 is the creation of the R Consortium. The Consortium is made of the R Foundation and a number of corporate sponsors. These include Microsoft, RStudio, TIBCO,  Google, HP, Oracle, Alteryx, Mango Solutions and Ketchum Trading. The stated goal of the R Consortium is "The open governance model for R Consortium includes an infrastructure steering committee that will direct technical decisions and oversee working group projects and a board of directors to guide business decisions."

So this has me a bit confused because up till now the technical decisions for Core R have been handled by the R foundation alone. Does this mean that the R foundation and R Core has given up this function to the R Consortium? Also what are the business decisions of R? It is an open source project that benefits from the contributions of its community. There are business that have been built around R including one I helped found, Revolution Analytics. However, the development of R was always independent of the business decisions. 

These changes could be good for R. It is just hard to tell from the limited information in the press releases. Open Source project have depended on and developed with the support of companies (Java, Hadoop and even R). I think the concern comes if the development of R is usurped or re-directed away from the needs and desires of the community toward the needs and desires of the Consortium members. For example new versions of R that are designed to run faster on the Microsoft Cloud with HP servers using TIBCO graphic through RStudio. I know that sounds as extreme as it is unlikely, but there are so far precious little details about the Consortium out there to lessen the concern.

I hope these worries are unfounded and the Consortium has truly altruistic intentions by doing the time consuming work needed to guide R to reach its full potential and remain a vibrant project the evolves and expands. Only time will tell

Press Release