Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: An example of using Random Forest in Caret with R.

Tuesday, October 28, 2014

An example of using Random Forest in Caret with R.

Here is an example of using Random Forest in the Caret Package with R.
First Load in the required packages
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(ggplot2)
require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
Read in the Training and Test Set.
training_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training<-read.csv(training_URL,na.strings=c("NA",""))
test<-read.csv(test_URL,na.strings=c("NA",""))
Then I got rid of the columns that is simply an index, timestamp or username.
training<-training[,7:160]
test<-test[,7:160]
Remove the columns that are mostly NAs. They could be useful in the model, but it is easier to cut the data.frame down and see if it gives good results
mostly_data<-apply(!is.na(training),2,sum)>19621
training<-training[,mostly_data]
test<-test[,mostly_data]
dim(training)
## [1] 19622    54
I partitioned the training set into a smaller set called training1 really to speed up the running of the model
InTrain<-createDataPartition(y=training$classe,p=0.3,list=FALSE)
training1<-training[InTrain,]
So I used caret with random forest as my model with 5 fold cross validation
rf_model<-train(classe~.,data=training1,method="rf",
                trControl=trainControl(method="cv",number=5),
                prox=TRUE,allowParallel=TRUE)
print(rf_model)
## Random Forest 
## 
## 5889 samples
##   53 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##    2    1         1      0.006        0.008   
##   27    1         1      0.005        0.006   
##   53    1         1      0.006        0.007   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.
print(rf_model$finalModel)
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE,      allowParallel = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.88%
## Confusion matrix:
##      A    B    C   D    E class.error
## A 1674    0    0   0    0     0.00000
## B   11 1119    9   1    0     0.01842
## C    0   11 1015   1    0     0.01168
## D    0    2   10 952    1     0.01347
## E    0    1    0   5 1077     0.00554
That is a pretty amazingly good model! .987 accuracy! I usually hope for something in the >.7 in my real work.

36 comments:

  1. Hi, what is the result for your test set?

    ReplyDelete
  2. Using caret for random forests is so slow on my laptop, compared to using the random forest package.
    I tried to find some information on running R in parallel. I installed the multicore package and ran the following before train():

    library(doMC)
    registerDoMC(5)

    That seems to help.

    ReplyDelete
  3. Nate, you are correct you need to add a Do package otherwise there is no parallel backend. usually those libraries come across as dependancies when you load the caret package. without them. remember caret is doing a lot of other work beside just running the random forest depending on your actual call. Also try the ranger random forest package in R. It is much faster than andy's package.

    ReplyDelete
  4. Hi NPHard,
    I tried the ranger package but some functions were not visible, such ad train and createDataPartition.
    what are their substitute in ranger?

    Thanks,

    ReplyDelete
  5. @Tita you can continue using caret with method="ranger" to build the model using ranger.

    ReplyDelete
  6. Very helpful! But still I don't really understand what mtry is doing. Is it a number of trees we are building?

    ReplyDelete
    Replies
    1. It's the number of variables tried at each node. The standard value is n/3 for regression and sqrt(n) for classification (n is the total number of variables).

      Delete
  7. Great post! I am see the programming coding and step by step execute the outputs.I am gather this coding more information. It's helpful for me my friend. Also great blog here with all of the valuable information you have.
    R Language Training in Chennai

    ReplyDelete
  8. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  9. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  10. Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.


    Data Science Online Training|
    Hadoop Online Training
    R Programming Online Training|

    ReplyDelete
  11. Really cool post, highly informative and professionally written and I am glad to be a visitor of this perfect blog, thank you for this rare info!

    Data science training in Marathahalli|
    Data science training in Bangalore|
    Hadoop Training in Marathahalli|
    Hadoop Training in Bangalore|

    ReplyDelete
  12. Thank you for the informative post! Is there anyway to visualize random forest like those for CART? Thank you!

    ReplyDelete
  13. Really useful information. we are providing best data science online training from industry experts.

    ReplyDelete
  14. Your conclusion that the model is amazing is likely false as the model seems to be overfitting. The assessment of a model should never be based on training data but on a separate valdation set. Since training data was used to create the model it is given that it fits well on the same data.

    ReplyDelete
  15. informative blog thanks for providing such a great information.
    Data Science Training in Hyderabad

    ReplyDelete
  16. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.

    java training in omr

    java training in annanagar

    java training in chennai

    java training in marathahalli

    java training in btm layout

    java training in rajaji nagar

    java training in jayanagar

    ReplyDelete
  17. Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
    Big data training in tambaram
    Big data training in tambaram

    ReplyDelete
  18. Thanks for the informative article. This is one of the best resources I have found in quite some time. Nicely written and great info. I really cannot thank you enough for sharing.

    rpa training in marathahalli

    rpa training in btm

    rpa training in kalyan nagar

    rpa training in electronic city

    rpa training in chennai

    rpa training in pune

    rpa online training

    ReplyDelete
  19. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.
    Devops Training in pune

    Devops Training in Chennai

    Devops Training in Bangalore

    AWS Training in chennai

    AWS Training in bangalore

    ReplyDelete
  20. Wonderful article, very useful and well explanation. Your post is extremely incredible. I will refer this to my candidates...

    Digital Marketing Training in Mumbai

    Six Sigma Training in Dubai

    Six Sigma Abu Dhabi

    ReplyDelete
  21. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.

    rpa training in Chennai | rpa training in pune

    rpa training in tambaram | rpa training in sholinganallur

    rpa training in Chennai | rpa training in velachery

    rpa online training | rpa training in bangalore

    ReplyDelete
  22. Your story is truly inspirational and I have learned a lot from your blog. Much appreciated.
    python training in tambaram
    python training in annanagar
    python training in OMR

    ReplyDelete
  23. From your discussion I have understood that which will be better for me and which is easy to use. Really, I have liked your brilliant discussion. I will comThis is great helping material for every one visitor. You have done a great responsible person. i want to say thanks owner of this blog.

    java training in chennai | java training in bangalore


    java training in tambaram | java training in velachery

    ReplyDelete


  24. Nice blog..! I really loved reading through this article. Thanks for sharing such a amazing post with us and keep blogging...


    Best Data Science online training in Hyderabad

    Data Science training in Hyderabad

    Data Science online training in Hyderabad

    ReplyDelete
  25. Hadoop concepts, Applying modelling through R programming using Machine learning algorithms and illustrate impeccable Data Visualization by leveraging on 'R' capabilities.With companies across industries striving to bring their research and analysis (R&A) departments up to speed, the demand for qualified data scientists is rising.
    data science training in bangalore

    ReplyDelete
  26. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
    Python training in usa
    Python training in marathahalli
    Python training in pune

    ReplyDelete
  27. mytectra placement Portal is a Web based portal brings Potentials Employers and myTectra Candidates on a common platform for placement assistance.

    ReplyDelete
  28. This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me.. 
    Devops Training in pune
    Devops Training in Chennai
    Devops training in sholinganallur
    Devops training in velachery
    Devops training in annanagar
    Devops training in tambaram

    ReplyDelete