Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: Introduction to using Random Forests for the Kaggle Titanic Data Set

Monday, June 30, 2014

Introduction to using Random Forests for the Kaggle Titanic Data Set

During the summer a number of the members of the Connecticut R User Group decided to work on a Kaggle competition data set to improve our R programming skills. The first data set we tried was the Titanic data set. This is a fairly simple data set from which we are trying to predict who will survive and who will not. The task for the team was to simply load the data set and the Random Forest package and run the basic model. We completed that task last week. The R code to do that is below:

#load in Package randomForest
install.packages("randomForest")
library(randomForest)
## Read in Titanic Data training and test set
train.data<-read.csv("~/titanic/train(2).csv")
test.data<-read.csv("~/titanic/test.csv")
## Convert data into a simpler dataframe
train<<- data.frame(survived=train.data$Survived,
                    age=train.data$Age,
                    fare=train.data$Fare,
                    pclass=train.data$Pclass,
                    sex=as.integer(factor(train.data$Sex))
                    )
test<<- data.frame( age=test.data$Age,
                    fare=test.data$Fare,
                    pclass=test.data$Pclass,
                    sex=as.integer(factor(test.data$Sex))
                    )
## now we need to get rid of the NAs and make
train$fare[ is.na( train$fare) ]   <- 0
train$age[ is.na( train$age) ]     <- 30
test$fare[ is.na( test$fare) ]   <- 0
test$age[ is.na( test$age) ]     <- 30
labels <- as.factor(train[,1])
train <- train[,-1]
# fit a random forest and make a prediction
rf <- randomForest(train, labels, xtest=test, ntree=100,do.trace=TRUE)
results<<-predictions <- levels(labels)[rf$test$predicted]


As we do more work on the model and try other approached I will update on my blog.

No comments:

Post a Comment