Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: March 2015

Monday, March 9, 2015

A new version of R was released today. R 3.1.3 also called "Smooth Sidewalk" is expected to be the last release before R's annual release version 3.2. Here is a list of the changed to R included in this most recent release.


  • The internal method of download.file() can now handle files larger than 2GB on 32-bit builds which support such files (tested on 32-bit Rrunning on 64-bit Windows).
  • kruskal.test() warns on more types of suspicious input.
  • The as.dendrogram() method for "hclust" objects gains a check argument protecting against memory explosion for invalid inputs.
  • capabilities() has a new item long.double which indicates if the build uses a long double type which is longer than double.
  • nlm() no longer modifies the callback argument in place (a new vector is allocated for each invocation, which mimics the implicit duplication that occurred in R < 3.1.0); note that this is a change from the previously documented behavior. (PR#15958)
  • icuSetCollate() now accepts locale = "ASCII" which uses the basic C function strcmp and so collates strings byte-by-byte in numerical order.
  • sessionInfo() tries to report the OS version in use (not just that compiled under, and including details of Linux distributions).
  • model.frame() (used by lm() and many other modelling functions) now warns when it drops contrasts from factors. (Wish of PR#16119)
  • install.packages() and friends now accept the value type = "binary" as a synonym for the native binary type on the platform (if it has one).
  • Single source or binary files can be supplied for install.packages(type = "both") and the appropriate type and repos = NULL will be inferred.
  • New function pcre_config() to report on some of the configuration options of the version of PCRE in use. In particular, this reports if regular expressions using p{xx} are supported.
  • (Windows.) download.file(cacheOK = FALSE) is now supported when ‘internet2.dll’ is used.
  • browseURL() has been updated to work with Firefox 36.0 which has dropped support for the -remote interface.

Thursday, March 5, 2015

Boston Data Science Conference in May

There is a Data Science Conference in Boston on May 30-31. This is going to be a great conference. It is unique in that it brings together some of the best minds not only of R but also Python. When those two groups have gotten together in the past, the conversations have been powerful and productive. When Wes Mckinney spoke at the NYC R metope about his Python Panda project, it started one of the most thought discussions of the elements of the two languages.

This meeting should have all of that and then some. I have heard most of the speakers before and each one of them is usually the highlight of the conference that I am at. Also I am a big fan of the Boston Data Science Community. It is vibrant and diverse. They use Python, R, Julia and other tools to get their work done in finance, pharma and other fields. The Meetup groups from John's Predictive Analytics Meetup to Josh's Greater Boston R Meetup are first class and massively supported with thousands of members.

I also like the fact that this conference comes a month after the NYC R Conference. I feel that in a way they will both give you pieces of a larger puzzle, and at reasonable cost. You can go to both of these conferences for a third of the cost it would be to go to just Strata or PAWS.

Here is a link to the Conference:

Wednesday, March 4, 2015

New York R Conference

There is going to be an R Conference in New York on April 24-25. This conference has been in the making for a long time and promises to be an outstanding gathering of R users and data scientists.

New York is one of the world centers of data science, and it is good to see that it is finally getting its due with a technical conference. Yes, there is a Strata and PAWS conference in New York, but those conferences tend to be higher level then I am interested in.

The cost of this conference is quite reasonable with many options under $550 for the two days. That leaves plenty of money left over for food and beer.

The early list of speakers include such luminaries as Andy Gelman, Jared Lander, Bryan Lewis and Harlan Harris.

Here is a link to the conference website: NYC R Conference

Monday, March 2, 2015

Simple Problems in R cause big headaches....Always check your variable class

Recently a frined of mine who is a newer user to R was having some problems getting his data to graph properly. He went around and around with it, but simply could not figure out what was going on. This is one of the tough things about R. It will usually run your code even if there is a little problem with what your are doing. I have been stuck for hours because of something I forgot to do. The most common reason I end up having a problem is that the data in not of the class I thought is was. So the code was doing everything it was supposed to do correctly, but I had not either changed the class of the data or verified that it was correct anywhere in my code. When you are developing I highly reccomend that you do that. It will save you a ton of time and stress.
I will show this issue because it does a great job of showing the problem.
First, let me read in the data.
##    Time
## 1 13:10
## 2 12:00
## 3 11:50
## 4 10:30
## 5  8:40
## 6  9:00
Now he wanted to plot these times.
plot(table(data$Time),main="December Incident Reports by Time of Day", xlab="Time of Day in Hours",ylab="Number of Reports",lwd=4,col=8)
plot of chunk unnamed-chunk-2
It runs. However, you will notice it does not look right. The problems is that time is not of the right class and needs to be converted to the right one
## [1] "factor"
data2<[,1], format="%H:%M"))
hist(as.numeric(data2[,1]),main="December Incident Reports by Time of Day", xlab="Time of Day in Hours",ylab="Number of Reports")
plot of chunk unnamed-chunk-3
Now that look way better. Now all that has to be done is the analysis.