Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: An example of using Random Forest in Caret with R.

Tuesday, October 28, 2014

An example of using Random Forest in Caret with R.

Here is an example of using Random Forest in the Caret Package with R.
First Load in the required packages
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(ggplot2)
require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
Read in the Training and Test Set.
training_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training<-read.csv(training_URL,na.strings=c("NA",""))
test<-read.csv(test_URL,na.strings=c("NA",""))
Then I got rid of the columns that is simply an index, timestamp or username.
training<-training[,7:160]
test<-test[,7:160]
Remove the columns that are mostly NAs. They could be useful in the model, but it is easier to cut the data.frame down and see if it gives good results
mostly_data<-apply(!is.na(training),2,sum)>19621
training<-training[,mostly_data]
test<-test[,mostly_data]
dim(training)
## [1] 19622    54
I partitioned the training set into a smaller set called training1 really to speed up the running of the model
InTrain<-createDataPartition(y=training$classe,p=0.3,list=FALSE)
training1<-training[InTrain,]
So I used caret with random forest as my model with 5 fold cross validation
rf_model<-train(classe~.,data=training1,method="rf",
                trControl=trainControl(method="cv",number=5),
                prox=TRUE,allowParallel=TRUE)
print(rf_model)
## Random Forest 
## 
## 5889 samples
##   53 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##    2    1         1      0.006        0.008   
##   27    1         1      0.005        0.006   
##   53    1         1      0.006        0.007   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.
print(rf_model$finalModel)
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE,      allowParallel = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.88%
## Confusion matrix:
##      A    B    C   D    E class.error
## A 1674    0    0   0    0     0.00000
## B   11 1119    9   1    0     0.01842
## C    0   11 1015   1    0     0.01168
## D    0    2   10 952    1     0.01347
## E    0    1    0   5 1077     0.00554
That is a pretty amazingly good model! .987 accuracy! I usually hope for something in the >.7 in my real work.

8 comments:

  1. Hi, what is the result for your test set?

    ReplyDelete
  2. Using caret for random forests is so slow on my laptop, compared to using the random forest package.
    I tried to find some information on running R in parallel. I installed the multicore package and ran the following before train():

    library(doMC)
    registerDoMC(5)

    That seems to help.

    ReplyDelete
  3. Nate, you are correct you need to add a Do package otherwise there is no parallel backend. usually those libraries come across as dependancies when you load the caret package. without them. remember caret is doing a lot of other work beside just running the random forest depending on your actual call. Also try the ranger random forest package in R. It is much faster than andy's package.

    ReplyDelete
  4. Hi NPHard,
    I tried the ranger package but some functions were not visible, such ad train and createDataPartition.
    what are their substitute in ranger?

    Thanks,

    ReplyDelete
  5. @Tita you can continue using caret with method="ranger" to build the model using ranger.

    ReplyDelete
  6. Very helpful! But still I don't really understand what mtry is doing. Is it a number of trees we are building?

    ReplyDelete
    Replies
    1. It's the number of variables tried at each node. The standard value is n/3 for regression and sqrt(n) for classification (n is the total number of variables).

      Delete