During the summer a number of the members of the Connecticut R User Group decided to work on a Kaggle competition data set to improve our R programming skills. The first data set we tried was the Titanic data set. This is a fairly simple data set from which we are trying to predict who will survive and who will not. The task for the team was to simply load the data set and the Random Forest package and run the basic model. We completed that task last week. The R code to do that is below:
#load in Package randomForest
install.packages("randomForest")
library(randomForest)
## Read in Titanic Data training and test set
train.data<-read.csv("~/titanic/train(2).csv")
test.data<-read.csv("~/titanic/test.csv")
## Convert data into a simpler dataframe
train<<- data.frame(survived=train.data$Survived,
age=train.data$Age,
fare=train.data$Fare,
pclass=train.data$Pclass,
sex=as.integer(factor(train.data$Sex))
)
test<<- data.frame( age=test.data$Age,
fare=test.data$Fare,
pclass=test.data$Pclass,
sex=as.integer(factor(test.data$Sex))
)
## now we need to get rid of the NAs and make
train$fare[ is.na( train$fare) ] <- 0
train$age[ is.na( train$age) ] <- 30
test$fare[ is.na( test$fare) ] <- 0
test$age[ is.na( test$age) ] <- 30
labels <- as.factor(train[,1])
train <- train[,-1]
# fit a random forest and make a prediction
rf <- randomForest(train, labels, xtest=test, ntree=100,do.trace=TRUE)
results<<-predictions <- levels(labels)[rf$test$predicted]
As we do more work on the model and try other approached I will update on my blog.
I blog about world of Data Science with Visualization, Big Data, Analytics, Sabermetrics, Predictive HealthCare, Quant Finance, and Marketing Analytics using the R language.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment