Here is an example of using Random Forest in the Caret Package with R.
First Load in the required packages
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(ggplot2)
require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
Read in the Training and Test Set.
training_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training<-read.csv(training_URL,na.strings=c("NA",""))
test<-read.csv(test_URL,na.strings=c("NA",""))
Then I got rid of the columns that is simply an index, timestamp or username.
training<-training[,7:160]
test<-test[,7:160]
Remove the columns that are mostly NAs. They could be useful in the model, but it is easier to cut the data.frame down and see if it gives good results
mostly_data<-apply(!is.na(training),2,sum)>19621
training<-training[,mostly_data]
test<-test[,mostly_data]
dim(training)
## [1] 19622 54
I partitioned the training set into a smaller set called training1 really to speed up the running of the model
InTrain<-createDataPartition(y=training$classe,p=0.3,list=FALSE)
training1<-training[InTrain,]
So I used caret with random forest as my model with 5 fold cross validation
rf_model<-train(classe~.,data=training1,method="rf",
trControl=trainControl(method="cv",number=5),
prox=TRUE,allowParallel=TRUE)
print(rf_model)
## Random Forest
##
## 5889 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 1 1 0.006 0.008
## 27 1 1 0.005 0.006
## 53 1 1 0.006 0.007
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
print(rf_model$finalModel)
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE, allowParallel = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.88%
## Confusion matrix:
## A B C D E class.error
## A 1674 0 0 0 0 0.00000
## B 11 1119 9 1 0 0.01842
## C 0 11 1015 1 0 0.01168
## D 0 2 10 952 1 0.01347
## E 0 1 0 5 1077 0.00554
That is a pretty amazingly good model! .987 accuracy! I usually hope for something in the >.7 in my real work.