During the summer a number of the members of the Connecticut R User Group decided to work on a Kaggle competition data set to improve our R programming skills. The first data set we tried was the Titanic data set. This is a fairly simple data set from which we are trying to predict who will survive and who will not. The task for the team was to simply load the data set and the Random Forest package and run the basic model. We completed that task last week. The R code to do that is below:
#load in Package randomForest
install.packages("randomForest")
library(randomForest)
## Read in Titanic Data training and test set
train.data<-read.csv("~/titanic/train(2).csv")
test.data<-read.csv("~/titanic/test.csv")
## Convert data into a simpler dataframe
train<<- data.frame(survived=train.data$Survived,
age=train.data$Age,
fare=train.data$Fare,
pclass=train.data$Pclass,
sex=as.integer(factor(train.data$Sex))
)
test<<- data.frame( age=test.data$Age,
fare=test.data$Fare,
pclass=test.data$Pclass,
sex=as.integer(factor(test.data$Sex))
)
## now we need to get rid of the NAs and make
train$fare[ is.na( train$fare) ] <- 0
train$age[ is.na( train$age) ] <- 30
test$fare[ is.na( test$fare) ] <- 0
test$age[ is.na( test$age) ] <- 30
labels <- as.factor(train[,1])
train <- train[,-1]
# fit a random forest and make a prediction
rf <- randomForest(train, labels, xtest=test, ntree=100,do.trace=TRUE)
results<<-predictions <- levels(labels)[rf$test$predicted]
As we do more work on the model and try other approached I will update on my blog.
I blog about world of Data Science with Visualization, Big Data, Analytics, Sabermetrics, Predictive HealthCare, Quant Finance, and Marketing Analytics using the R language.
Monday, June 30, 2014
Tuesday, June 10, 2014
Taco Bell's Waffle Taco a novelty that needs to go the way of the pet rock.
This morning I decided to try the Waffle Taco at Taco Bell. What a mistake! It was terrible. A tasteless frozen waffle with rubbery pseudo food inside. Do not ever even bother trying this because you will just regret it like I do. No wonder they use old people in their commericals for this item.Their taste buds are already shot.
Sunday, June 8, 2014
Reading a large number of files into R
I know this is a fairly basic topic, but it is one that caused me problems lately. Normally I only have to read in one data file at a time or I read in a few tables separately.
If I am reading in a single file would do the following
>read.table("file")
or if it is online
>read.table("url")
If it is a csv file
read.csv("file")
Now the problem arose because I needed to read in 400 files from a directory, but the files were not numerically indexed. So to solve this problem I used the functions list.files and paste.
>names<-list.files("~/directory/")
>complete_names<-paste( "~directory", names, sep="")
>for (i in sep_along(names){
monitors<-rbind(monitors, read.csv(complete_names[i]))
It was slow, but got the job done.
If I am reading in a single file would do the following
>read.table("file")
or if it is online
>read.table("url")
If it is a csv file
read.csv("file")
Now the problem arose because I needed to read in 400 files from a directory, but the files were not numerically indexed. So to solve this problem I used the functions list.files and paste.
>names<-list.files("~/directory/")
>complete_names<-paste( "~directory", names, sep="")
>for (i in sep_along(names){
monitors<-rbind(monitors, read.csv(complete_names[i]))
It was slow, but got the job done.
Subscribe to:
Posts (Atom)