I blog about world of Data Science with Visualization, Big Data, Analytics, Sabermetrics, Predictive HealthCare, Quant Finance, and Marketing Analytics using the R language.
Monday, November 3, 2014
An Example of Parallel Computing Performance using DoMc for a Random Forest Model in Caret in R.
Parallel R
I was working on a project using Random Forest from the Caret Package in R. Since it was not a tiny data set and Random Forest tends to run fairly slowly I was utilizing the built in parallel capability of Caret which has Foreach R and a number of parallel backends available to it. Since I have a four core laptop I use the DoMC backend and set the number of Cores equal to 4. I know it sounds odd but it got me wondering if 4 was the optimal number workers (remember when you register the number it is not really the number of cores you have on your machine, but the number of workers you create to do recieve jobs). I mean would it be better if I registered five workers so that when one worker finished a job there would always be a worker ready for that core that just opened up. On the ther hand would I be better with 3 workers which would leave one core free to act as a head node ( I have heard this improves performance when you use MPI, but I have never actually used this approach). So I decided to do some tests to finds out.
First Load in the required packages
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(ggplot2)
require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
require(doMC)
## Loading required package: doMC
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
Then I read in the data set that I was palying with
training_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training<-read.csv(training_URL,na.strings=c("NA",""))
test<-read.csv(test_URL,na.strings=c("NA",""))
Then I got rid of the columns that is simply an index, timestamp or username.
training<-training[,7:160]
test<-test[,7:160]
Remove the columns that are mostly NAs. I wanted this to run in a reasonable amount of time even if it was on a single core.
mostly_data<-apply(!is.na(training),2,sum)>19621
training<-training[,mostly_data]
test<-test[,mostly_data]
dim(training)
## [1] 19622 54
I partitioned the training set into a smaller set called training1 really to speed up the running of the model. Again really to get it to run in a reasonable amount of time
InTrain<-createDataPartition(y=training$classe,p=0.3,list=FALSE)
training1<-training[InTrain,]
To establish a base line I ran the model on a single worker.
registerDoMC(cores=1)
ptm<-proc.time()
rf_model<-train(classe~.,data=training1,method="rf",
trControl=trainControl(method="cv",number=5),
prox=TRUE,allowParallel=TRUE)
time1<-proc.time()-ptm
print(time1)
## user system elapsed
## 737.271 5.038 742.307
Next try it with three workers
registerDoMC(cores=3)
ptm<-proc.time()
rf_model<-train(classe~.,data=training1,method="rf",
trControl=trainControl(method="cv",number=5),
prox=TRUE,allowParallel=TRUE)
time3<-proc.time()-ptm
print(time3)
## user system elapsed
## 323.22 2.58 209.38
Now four workers
registerDoMC(cores=4)
ptm<-proc.time()
rf_model<-train(classe~.,data=training1,method="rf",
trControl=trainControl(method="cv",number=5),
prox=TRUE,allowParallel=TRUE)
time4<-proc.time()-ptm
print(time4)
## user system elapsed
## 556.600 4.688 178.345
And finally 5 workers
registerDoMC(cores=5)
ptm<-proc.time()
rf_model<-train(classe~.,data=training1,method="rf",
trControl=trainControl(method="cv",number=5),
prox=TRUE,allowParallel=TRUE)
time5<-proc.time()-ptm
print(time5)
## user system elapsed
## 503.992 4.991 158.250
Also I want to check for 4 workers with four cross validations
registerDoMC(cores=4)
ptm<-proc.time()
rf_model<-train(classe~.,data=training1,method="rf",
trControl=trainControl(method="cv",number=4),
prox=TRUE,allowParallel=TRUE)
time6<-proc.time()-ptm
print(time6)
## user system elapsed
## 528.90 5.08 132.72
print(time1)
## user system elapsed
## 737.271 5.038 742.307
print(time3)
## user system elapsed
## 323.22 2.58 209.38
print(time4)
## user system elapsed
## 556.600 4.688 178.345
print(time5)
## user system elapsed
## 503.992 4.991 158.250
print(time6)
## user system elapsed
## 528.90 5.08 132.72
I actually ran this analysis a number of times, and consistently setting the number of workers to 5 on my 4 core machine yielded the best performance.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment