Tuesday, July 14, 2015

An example of using Cubist in R for prediction

An example of Cubist for prediction in R

Recently I was given a moderate sized data set to try and do a quick prediction on. I did not have a lot of time. When I pulled up the data set it had 5000 row and 254 predictors. 250 of the predictors were continuous and 4 were catagorical. Each column had 100s if NA , and I really did not feel like going through and using imputation. The outcome variable was continuous.I decides to use Quinlan’s Cubist in R which is a package maintained by Max Kuhn.
I was impressed by how quickly it ran and how good the the results were.
Below is the code of how I did the work using the Caret and Cubist Package in R.
First, I added the Caret Pacakage and the Cubist Package:

require(caret)

## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2

require(Cubist)

## Loading required package: Cubist

Then I read in the data set. Here is the summary of the structure as well

predictors <- read.csv("trainPredictors.csv")
predictors <-predictors[,-1]
outcomes <- read.csv("trainOutcomes.csv")
outcomes<- outcomes[,-1]
dim(predictors)

## [1] 5000  254

I used caret to make a training and test set of the data. I chose this to be a 80/20 split. I also split out the outcomes from the predictors in both the training and test set

inTrain<-createDataPartition(y = outcomes, p= .80)
inTrain<-unlist(inTrain)
trainpredictors<-predictors[inTrain,]
trainoutcomes<-outcomes[inTrain]
testpredictors<-predictors[-inTrain,]
testoutcomes<-outcomes[-inTrain]

Then I simply ran the model. Notice how quickly it ran

modelTree<- cubist(x = trainpredictors,y = trainoutcomes)

Next I used that model to do a prediction on the test set

mtPred<-predict(modelTree,testpredictors)

Finally I did an R^2 measure to see how it did

cor(mtPred,testoutcomes)^2

## [1] 0.840342

This is great result for not much effort!

Monday, July 13, 2015

Fastest way to read a CSV file into R

I never really was concerned about what is the quickest way to read a CSV into R. THe reason for this is most of the data sets I deal with are very sample. So the time to read the file in is usually not very important. However, recently I had a project the required reading not just one .CSV into R, but rather a whole series of CSVs into R. using the standard read.csv() function thqt is built into R just to forever. So I switched to the data.table function fread. What a difference! I understand that data.table has been around for a while, but for the newer R user it is a really good package to know about once you get beyond toy datasets.

So I thought it would be really help to see just what the difference is between the two methods. FOr this example I an still using a relatively small data set. It is a little over five and a half million rows by six columns.

So for the read.csv function built in R

## Start timer
ptm<-proc.time()
test1<-read.csv("baby_data.csv")
## Stop timer and print time
ptm<-proc.time()-ptm
dim(test1)

## [1] 5674089       6

print(ptm)

##    user  system elapsed 
##  33.427   0.495  33.945

for the fread function in the data.table

## Start timer
ptm<-proc.time()
require(data.table)

## Loading required package: data.table

test2<-fread("baby_data.csv")

## 
Read 64.9% of 5674089 rows
Read 86.2% of 5674089 rows
Read 5674089 rows and 6 (of 6) columns from 0.187 GB file in 00:00:05

ptm<-proc.time()-ptm
print(ptm)

##    user  system elapsed 
##   4.027   0.190   4.224

As you can see fread() is almost 10 times faster than read.csv to process this data set. That is pretty amazing. There is also a package called readr by Hadley Wickham that is a little slower than data.table but has some nice added features.

Thursday, July 2, 2015

The R Consortium

One of the new stories to come out UseR! 2015 is the creation of the R Consortium. The Consortium is made of the R Foundation and a number of corporate sponsors. These include Microsoft, RStudio, TIBCO, Google, HP, Oracle, Alteryx, Mango Solutions and Ketchum Trading. The stated goal of the R Consortium is "The open governance model for R Consortium includes an infrastructure steering committee that will direct technical decisions and oversee working group projects and a board of directors to guide business decisions."

So this has me a bit confused because up till now the technical decisions for Core R have been handled by the R foundation alone. Does this mean that the R foundation and R Core has given up this function to the R Consortium? Also what are the business decisions of R? It is an open source project that benefits from the contributions of its community. There are business that have been built around R including one I helped found, Revolution Analytics. However, the development of R was always independent of the business decisions.

These changes could be good for R. It is just hard to tell from the limited information in the press releases. Open Source project have depended on and developed with the support of companies (Java, Hadoop and even R). I think the concern comes if the development of R is usurped or re-directed away from the needs and desires of the community toward the needs and desires of the Consortium members. For example new versions of R that are designed to run faster on the Microsoft Cloud with HP servers using TIBCO graphic through RStudio. I know that sounds as extreme as it is unlikely, but there are so far precious little details about the Consortium out there to lessen the concern.

I hope these worries are unfounded and the Consortium has truly altruistic intentions by doing the time consuming work needed to guide R to reach its full potential and remain a vibrant project the evolves and expands. Only time will tell

Press Release

Big Computing

Tuesday, July 14, 2015

An example of using Cubist in R for prediction

An example of Cubist for prediction in R

Kirk Mettler

July 13, 2015

Monday, July 13, 2015

Fastest way to read a CSV file into R

Thursday, July 2, 2015

The R Consortium

Subscribe To My Blog

Tuesday, July 14, 2015

An example of using Cubist in R for prediction

An example of Cubist for prediction in R

Kirk Mettler

July 13, 2015

Monday, July 13, 2015

Fastest way to read a CSV file into R

Thursday, July 2, 2015

The R Consortium