Big Computing: February 2016

The Twitter R package by Jeff Gentry is a great way to get started in Text Analytics. This tutorial is about how to get setup to pull twitter data into R so that you can do some text analytics with it. Basically it is only the first step of getting data in a text analytics Workflow. I should also give a shout out here to Ted Kawartler for getting me interested in learning more about text analytics.

The first thing you need to do is downloadd the twitteR package and make it available in your R session.

library(twitteR)

Now on the Twitter side you need to do a few things to get setup if you have not done them already. I will go through this step by step so no one gets lost. It drives me crazy when people go too fast on the setup so I can't use the tool.

You need to have a twitter account. if you do not have one go to http://twitter.com/signup and set one up. Also you need to have a modile number as part of this account.

Now that you have created a twitter account you need to go to https://apps.twitter.com and sign on with your twitter account.

Once you have signed in you should see the following screen, and simply click on the button that says "Create New App".

Once you click on the "Create New App" button you will go to the Create an Application screen. There are three fields, a click box and a button you need to click on this page. The three fields are Name, Description and Website. The name of the application must be unique so this may take a few tries. The description needs to be at least 10 character long, and put in a website. If you do not have one you can use https://bigcomputing.blogspot.com. Now click the "Yes, I agree" box for the license agreement and click the "Create your Twitter application".

Once you successfully create an appplication you will be taken to the application page. Once there click on the "key and access token" tab. From that page you are going to need four things.

1. Consumer Key (API Key)

2. Consumer Secret (API Secret)

click the "Create my access token" button.

3. Access Token

4. Access Token Secret

Now re-open your R session and enter the following code using those four pieces of information.

consumer_key <- "your_consumer_key"

consumer_secret <- "your_consumer_secret"

access_token <- "your_access_token"

access_secret <- "your_access_secret"

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Now you are set up on the Twitter side and the R side you should be ready to go. I will cover doing some basic data pulls from Twitter in my next post

Cubist is a machine learning algorithm for continous outcomes. Cubist is a rule-based decision tree that automatically deals with missing values. This makes using Cubist ideal for baselining the perdictive value of your data set because if it is messy with a lot missing values, you do not have to deal with it. Cubist has become my first-try model for all continous outcome data sets.
Cubist was developed by Quinlan, and the R package for Cubist is maintained by Max Kuhn who also maintains the Caret package.
The code for calling a Cubist model is fairly standard for most predictive models in R.
cubist( x= trainingpredictors, y = trainingoutcomes)
There are some other elements that help improve the basic Cubist model’s performance, but let’s start with the simple model and go from there. For this example, we are going to use the BostonHousing data set the is contained in the mlbench package. The Data comes from a 1978 paper by Harrison and Rubinfeld (“Hedonic Prices and the Demand for Clean Air,” Journal of Environmental Economics and Management, vol. 5, 1978, pp. 81-102). It is a very well-know data set with 506 rows and 19 variables. Let’s look at that data set before we move on to creating and evaluating a predictive model in R.

require(mlbench)

## Loading required package: mlbench

require(caret)

## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2

require(Cubist)

## Loading required package: Cubist

data(BostonHousing)
dim(BostonHousing)

## [1] 506  14

str(BostonHousing)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : num  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ b      : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

As you can, see it is a data set with 506 rows and 19 columns of all numeric values. We are going to try to predict the value of the last column (medv) which is the median value of owner-occupied homes in USD 1000’s. Here is a description of the data in each of the 19 columns.
crim crime rate of town zn proportion of residential land zoned for lot over 25,000 sq.ft. indus proportion of non-retail business acres per town chas Charles River Dummey Variable ( = 1 if tract bounds Charles River, = 0 if not) nox nitrix oxides concentration in parts per 10 million rm average number of rooms per dwelling age proportion of owner occupied units built before 1940 dis weighted distances to five Boston Employment centers rad index of accessibility to radial highways tax full value property tax per USD 10,000 ptratio pupil to teacher ratio per town b 1000(B-0.63)^2 where B is the proportion of African Americans in the town lstat percentage of lower status of the population medv median value of owner-occupied homes in USD 1000’s
Normally when you build a predictive model, you break that data set into two or three data sets - training, test, and hold out data set. That may differ slightly if you are using cross-validation, but in general I make a training and a test set. Here I will use an 80/20 split . I am also going to do a little modification to the chas variable

BostonHousing$chas <- as.numeric(BostonHousing$chas) - 1
set.seed(1)
inTrain <- sample(1:nrow(BostonHousing), floor(.8*nrow(BostonHousing)))
trainingPredictors <- BostonHousing[ inTrain, -14]
testPredictors     <- BostonHousing[-inTrain, -14]
trainingOutcome <- BostonHousing$medv[ inTrain]
testOutcome     <- BostonHousing$medv[-inTrain]

Now all we have to do is fit the model, make a prediction and then evaluate the prediction. Since we are predicting a continous variable here, we will use Root Mean Squared Error (RSME).
So fit the model

trainingPredictors <- BostonHousing[ inTrain, -14]
testPredictors     <- BostonHousing[-inTrain, -14]
trainingOutcome <- BostonHousing$medv[ inTrain]
testOutcome     <- BostonHousing$medv[-inTrain]
modelTree <- cubist(x = trainingPredictors, y = trainingOutcome)
modelTree

## 
## Call:
## cubist.default(x = trainingPredictors, y = trainingOutcome)
## 
## Number of samples: 404 
## Number of predictors: 13 
## 
## Number of committees: 1 
## Number of rules: 4

Look at the model

summary(modelTree)

## 
## Call:
## cubist.default(x = trainingPredictors, y = trainingOutcome)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Wed Feb 17 21:19:55 2016
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 404 cases (14 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [88 cases, mean 13.81, range 5 to 27.5, est err 2.10]
## 
##     if
##  nox > 0.668
##     then
##  outcome = 2.07 + 3.14 dis - 0.35 lstat + 18.8 nox + 0.007 b
##            - 0.12 ptratio - 0.008 age - 0.02 crim
## 
##   Rule 2: [153 cases, mean 19.54, range 8.1 to 31, est err 2.16]
## 
##     if
##  nox <= 0.668
##  lstat > 9.59
##     then
##  outcome = 34.81 - 1 dis - 0.72 ptratio - 0.056 age - 0.19 lstat + 1.5 rm
##            - 0.11 indus + 0.004 b
## 
##   Rule 3: [39 cases, mean 24.10, range 11.9 to 50, est err 2.73]
## 
##     if
##  rm <= 6.23
##  lstat <= 9.59
##     then
##  outcome = 11.89 + 3.69 crim - 1.25 lstat + 3.9 rm - 0.0045 tax
##            - 0.16 ptratio
## 
##   Rule 4: [128 cases, mean 31.31, range 16.5 to 50, est err 2.95]
## 
##     if
##  rm > 6.23
##  lstat <= 9.59
##     then
##  outcome = -1.13 + 1.6 crim - 0.93 lstat + 8.6 rm - 0.0141 tax
##            - 0.83 ptratio - 0.47 dis - 0.019 age - 1.1 nox
## 
## 
## Evaluation on training data (404 cases):
## 
##     Average  |error|               2.27
##     Relative |error|               0.34
##     Correlation coefficient        0.94
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##     78%   100%    lstat
##     59%    53%    nox
##     41%    78%    rm
##           100%    ptratio
##            90%    age
##            90%    dis
##            62%    crim
##            59%    b
##            41%    tax
##            38%    indus
## 
## 
## Time: 0.0 secs

Make a prediction

mtPred <- predict(modelTree, testPredictors)

Get the RMSE

sqrt(mean((mtPred - testOutcome)^2))

## [1] 3.337924

That is not bad, but we can do better using Committees and Neighbors
Model Committes are created by generating a rule-based sequence of models similar to boosting. The number of committees can range from 1 to 100.
Let’s do a committee Cubist model with committees set to 100

set.seed(1)
committeeModel <- cubist(x = trainingPredictors, y = trainingOutcome, committees = 100)
## Get RMSE of COmmittee
comPred <- predict(committeeModel, testPredictors)
## RMSE
sqrt(mean((comPred - testOutcome)^2))

## [1] 2.779002

Now let’s add neighbors to the committees, which adjusts the model based adjacent solutions.

instancePred <- predict(committeeModel, testPredictors, neighbors = 4)
sqrt(mean((instancePred - testOutcome)^2))

## [1] 2.566348

So now the question is, what combination of committees and neighbors yields the best prediction? We can answer that by creating a vector of possible committees, and a vector of possible neighbors, then seeing where the RSME is best.

set.seed(1)
cTune <- train(x = trainingPredictors, y = trainingOutcome,"cubist",
               tuneGrid = expand.grid(.committees = c(1, 10, 50, 100),
                                      .neighbors = c(0, 1, 5, 9)),
                                      trControl = trainControl(method = "cv"))
cTune

## Cubist 
## 
## 404 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 363, 363, 363, 363, 362, 365, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   RMSE SD    Rsquared SD
##     1         0          4.081800  0.7916640  1.3007653  0.15005686 
##     1         1          4.111955  0.7950087  1.2540113  0.13995896 
##     1         5          3.943515  0.8054412  1.2727587  0.14070680 
##     1         9          3.959522  0.8022459  1.3305391  0.14884521 
##    10         0          3.371765  0.8597818  0.9354412  0.08111265 
##    10         1          3.370218  0.8681521  0.8462733  0.07253983 
##    10         5          3.168392  0.8767757  0.9409569  0.07777561 
##    10         9          3.207153  0.8725973  0.9499315  0.07980860 
##    50         0          3.238911  0.8704658  0.9819922  0.08369843 
##    50         1          3.257555  0.8741483  0.9284914  0.08006349 
##    50         5          3.035711  0.8845178  1.0167411  0.08284853 
##    50         9          3.071004  0.8810091  1.0233749  0.08444221 
##   100         0          3.211165  0.8713608  1.0185290  0.08500905 
##   100         1          3.254918  0.8739276  0.9853192  0.08458200 
##   100         5          3.005851  0.8855715  1.0492541  0.08529563 
##   100         9          3.044205  0.8820627  1.0572761  0.08671512 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were committees = 100 and neighbors
##  = 5.

As you can see, Cubist does really well as a predictive model.

Big Computing

Tuesday, February 23, 2016

Setting up the Twitter R package for text analytics

Wednesday, February 17, 2016

A Machine Learning Example in R for Continous Outcomes using Cubist

Subscribe To My Blog

Tuesday, February 23, 2016

Setting up the Twitter R package for text analytics

Wednesday, February 17, 2016

A Machine Learning Example in R for Continous Outcomes using Cubist