Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: An Example of Parallel Computing Performance using DoMc for a Random Forest Model in Caret in R.

Monday, November 3, 2014

An Example of Parallel Computing Performance using DoMc for a Random Forest Model in Caret in R.

Parallel R I was working on a project using Random Forest from the Caret Package in R. Since it was not a tiny data set and Random Forest tends to run fairly slowly I was utilizing the built in parallel capability of Caret which has Foreach R and a number of parallel backends available to it. Since I have a four core laptop I use the DoMC backend and set the number of Cores equal to 4. I know it sounds odd but it got me wondering if 4 was the optimal number workers (remember when you register the number it is not really the number of cores you have on your machine, but the number of workers you create to do recieve jobs). I mean would it be better if I registered five workers so that when one worker finished a job there would always be a worker ready for that core that just opened up. On the ther hand would I be better with 3 workers which would leave one core free to act as a head node ( I have heard this improves performance when you use MPI, but I have never actually used this approach). So I decided to do some tests to finds out. First Load in the required packages require(caret) ## Loading required package: caret ## Loading required package: lattice ## Loading required package: ggplot2 require(ggplot2) require(randomForest) ## Loading required package: randomForest ## randomForest 4.6-10 ## Type rfNews() to see new features/changes/bug fixes. require(doMC) ## Loading required package: doMC ## Loading required package: foreach ## Loading required package: iterators ## Loading required package: parallel Then I read in the data set that I was palying with training_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv" test_URL<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv" training<-read.csv(training_URL,na.strings=c("NA","")) test<-read.csv(test_URL,na.strings=c("NA","")) Then I got rid of the columns that is simply an index, timestamp or username. training<-training[,7:160] test<-test[,7:160] Remove the columns that are mostly NAs. I wanted this to run in a reasonable amount of time even if it was on a single core. mostly_data<-apply(!is.na(training),2,sum)>19621 training<-training[,mostly_data] test<-test[,mostly_data] dim(training) ## [1] 19622 54 I partitioned the training set into a smaller set called training1 really to speed up the running of the model. Again really to get it to run in a reasonable amount of time InTrain<-createDataPartition(y=training$classe,p=0.3,list=FALSE) training1<-training[InTrain,] To establish a base line I ran the model on a single worker. registerDoMC(cores=1) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time1<-proc.time()-ptm print(time1) ## user system elapsed ## 737.271 5.038 742.307 Next try it with three workers registerDoMC(cores=3) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time3<-proc.time()-ptm print(time3) ## user system elapsed ## 323.22 2.58 209.38 Now four workers registerDoMC(cores=4) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time4<-proc.time()-ptm print(time4) ## user system elapsed ## 556.600 4.688 178.345 And finally 5 workers registerDoMC(cores=5) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE) time5<-proc.time()-ptm print(time5) ## user system elapsed ## 503.992 4.991 158.250 Also I want to check for 4 workers with four cross validations registerDoMC(cores=4) ptm<-proc.time() rf_model<-train(classe~.,data=training1,method="rf", trControl=trainControl(method="cv",number=4), prox=TRUE,allowParallel=TRUE) time6<-proc.time()-ptm print(time6) ## user system elapsed ## 528.90 5.08 132.72 print(time1) ## user system elapsed ## 737.271 5.038 742.307 print(time3) ## user system elapsed ## 323.22 2.58 209.38 print(time4) ## user system elapsed ## 556.600 4.688 178.345 print(time5) ## user system elapsed ## 503.992 4.991 158.250 print(time6) ## user system elapsed ## 528.90 5.08 132.72 I actually ran this analysis a number of times, and consistently setting the number of workers to 5 on my 4 core machine yielded the best performance.

No comments:

Post a Comment