Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: An example of using Random Forest in Caret with R.

Tuesday, October 28, 2014

An example of using Random Forest in Caret with R.

Here is an example of using Random Forest in the Caret Package with R.
First Load in the required packages
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
Read in the Training and Test Set.
Then I got rid of the columns that is simply an index, timestamp or username.
Remove the columns that are mostly NAs. They could be useful in the model, but it is easier to cut the data.frame down and see if it gives good results
## [1] 19622    54
I partitioned the training set into a smaller set called training1 really to speed up the running of the model
So I used caret with random forest as my model with 5 fold cross validation
## Random Forest 
## 5889 samples
##   53 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 
## Resampling results across tuning parameters:
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##    2    1         1      0.006        0.008   
##   27    1         1      0.005        0.006   
##   53    1         1      0.006        0.007   
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE,      allowParallel = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
##         OOB estimate of  error rate: 0.88%
## Confusion matrix:
##      A    B    C   D    E class.error
## A 1674    0    0   0    0     0.00000
## B   11 1119    9   1    0     0.01842
## C    0   11 1015   1    0     0.01168
## D    0    2   10 952    1     0.01347
## E    0    1    0   5 1077     0.00554
That is a pretty amazingly good model! .987 accuracy! I usually hope for something in the >.7 in my real work.


  1. Hi, what is the result for your test set?

  2. Using caret for random forests is so slow on my laptop, compared to using the random forest package.
    I tried to find some information on running R in parallel. I installed the multicore package and ran the following before train():


    That seems to help.

  3. Nate, you are correct you need to add a Do package otherwise there is no parallel backend. usually those libraries come across as dependancies when you load the caret package. without them. remember caret is doing a lot of other work beside just running the random forest depending on your actual call. Also try the ranger random forest package in R. It is much faster than andy's package.

  4. Hi NPHard,
    I tried the ranger package but some functions were not visible, such ad train and createDataPartition.
    what are their substitute in ranger?


  5. @Tita you can continue using caret with method="ranger" to build the model using ranger.

  6. Very helpful! But still I don't really understand what mtry is doing. Is it a number of trees we are building?

    1. It's the number of variables tried at each node. The standard value is n/3 for regression and sqrt(n) for classification (n is the total number of variables).

  7. Great post! I am see the programming coding and step by step execute the outputs.I am gather this coding more information. It's helpful for me my friend. Also great blog here with all of the valuable information you have.
    R Language Training in Chennai

  8. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

  9. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

  10. Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.

    Data Science Online Training|
    Hadoop Online Training
    R Programming Online Training|

  11. Really cool post, highly informative and professionally written and I am glad to be a visitor of this perfect blog, thank you for this rare info!

    Data science training in Marathahalli|
    Data science training in Bangalore|
    Hadoop Training in Marathahalli|
    Hadoop Training in Bangalore|

  12. Thank you for the informative post! Is there anyway to visualize random forest like those for CART? Thank you!

  13. Really useful information. we are providing best data science online training from industry experts.

  14. Your conclusion that the model is amazing is likely false as the model seems to be overfitting. The assessment of a model should never be based on training data but on a separate valdation set. Since training data was used to create the model it is given that it fits well on the same data.