Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: April 2011

Wednesday, April 27, 2011

Cubist package for R on Cran

Today an R port of Cubist was released to Cran. It is another powerful tool among the R Packages. Below is an excerpt from the vignette:

Cubist is a rule-based model that is an extension of Quinlan's M5 model tree. A tree is grown where
the terminal leaves contain linear regression models. These models are based on the predictors used
in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction
is made using the linear regression model at the terminal node of the tree, but is "smoothed" by
taking into account the prediction from the linear model in the previous node of the tree (which
also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths
from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for
simpli cation.

This is explained better in Quinlan (1992). Wang and Witten (1997) attempted to recreate this
model using a "rational reconstruction" of Quinlan (1992) that is the basis for the M5P model in
Weka (and the R package RWeka).

Here is a good example of Cubist being used on Visual Data. Cubist was created by RuleQuest and their GPL version of Cubist is available for download there.

Which Parallel for R?

There are many Parallel backends for R/Foreach which includes DoRedis, DoNWS, DoMPI, DoSnow and the list goes on and on. Yesterday I even ran across an commerical HPC backend for R from a company called Zircon Computing along with the very solid ParallelR package from Revolution Analytics. I think it is great that there are some many options to improve R's performance. Options can be great, but they can also be daunting.

Amy Szczparnski did a nice presentation comparing some of the options for running Parallel processing in R at the Greater Boston useR Meetup, but this is an area where more comparisons of the pros and cons of different options and benchmarking of the various methods needs to be done.

Hadley Wickham was kind enough to send me a paper written in 2009  about the State of the Art in Parallel Computing in R (Markus Schmidberger, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, Ulrich Mansmann) which compares 16 packages that I had not seen before.

Monday, April 25, 2011

Newbie R

It is generally accepted that R has one of the steepest learning curves of all the statisitical platforms (SAS, SPSS, etc.). For a person new to the R environment or someone trying to get the lay of the land there are many ways to start. There are some online tuturials and books, I used R in a Nutshell myself. However, more needs be done for developing basic skills of new R users. I had seen comments that a good IDE or GUI like  RStudio can flatten out the learning curve. Another option is going to a training class offered by many different companies. There are also the many local area R user groups.

The R User Meetups are a great place to get started with R. Sometimes the presentations at the R meetups are directed more at the accomplished user than the Newbie. The Greater Boston Area useRs group has come up with a great idea on how to help the newer members to the R community.

At every meeting Jeffery Breen has been doing an opening talk that is directed at the new R user.  So far he has done:

Reshaping Data in R
Grouping & Summarizing Data in R
R plus 15 minutes = Hadoop Cluster

These talks and others like them are a critical piece and an important contribution to building a large competent user base of any open source tools. I am hopeful that other groups will give more of these basic talks and post them so that the R community can build a large repository of presentations on how to do the basics in R.

Friday, April 22, 2011

I got a golden R/Finance 2011!

So I found a cheap plane ticket and a discounted room therefore I am on my way to Chicago next week and R in Finance 2011.  Friday's line up includes Michael Kane on FINRA Regulation a week before he presents Esperr at the Boston useR group. Followed close behind by the Ambassador of Cool at Large, Bryan Lewis presenting the Betfair package. The Cerebral Mastication man himself, J. D. Long, winner of the best blog name in the R community closes out the first day with a talk on Segue for R.

The next day may be an even stronger group. I am especially looking forward the Parallel Benchmarking presentation from Niemenmaa. It should be a great weekend topped off by Sunday May First which is my birthday.

Thursday, April 21, 2011

1918 Cubs have the real Curse from 1918

So I guess it is not even a new story that the 1918 Cubs may have thrown the World Series. I had wondered in a previous article if predictive analytics would have spotted a fixed game as an outlier and therefore possibly fixed. The great thing about baseball is there is so much data and people have studied it all. The answer seems to be that there is nothing in the overall performance or individual numbers of the cubs players that indicate a problem. I guess there were a couple of plays that were questionable, but most believe not enough to prove a fixed Series. One of the players is pitcher Phil Douglas who is later banned from baseball for life in 1922 for seeking to get paid to lose the pennant. Also I love that Douglas was one of 17 pitchers allowed to continue to throw the spitball after it was banned in 1920. The other player is Max Flack who is better known for being traded to the other team after the first game of a double header.

I took a brief look at the box scores yesterday, and I agree there is nothing that stands out. It looks like a good low scoring series. I do not believe these Cubs would have gotten caught by the numbers even now if they did in fact throw the series. The funny thing is that this is the last World Series that the Red Sox won before "the Curse" that resulted from trading Babe Ruth to the Yankees.  Given that the Cubs still have not won a World Series since they lost the 1918 World Series maybe there is Karma in baseball.

Wednesday, April 20, 2011

Packbots are the Brain Child of a great guy

When I was a sophomore in college I could not find an apartment so a friend of mine let me stay in his apartment for most of the first term. The same guy helped me write a forecasting program that was my senior project two years later. He was a great guy who was always there. His name is Todd Pack. When I meet him the only thing he wanted to do was build robots. He spent much of his free time either writing code on his computer ( named "Bear") or building robots in the lab. It is what he loved to do.

Todd got his PhD out of Vanderbilt with a thesis on IMA, and took a job at iRobot. Yup, they make the Roomba vacuum cleaner. It is my understanding the reason the Roomba was created was because in the early 1990s there were no government contracts for robots so they needed a consumer product to survive.

I haven't seen or heard from Todd in a number of years. Then about a month a ago I went to a talk by Yann LeCun on vision and learning algorithms for robots. The videos of the robots reminded me of the work Todd Pack did in the labs back at Cornell all those years ago. Yesterday I read an article saying that the US was sending Robots to Japan to help out at the Fukaushima Nuclear Power Plant (article). It turns out the robots come from Dr Pack's company iRobot, and I am sure he had a hand in their design. I am so happy that Todd has created a career doing what he loves, and his work absolutely saves lives. I have my guess why they call them Packbots, but it is only a guess.

Now if we could only make robots that do not look like Johnny number 5.

Here is a recent video on MSNBC featuring the iRobot's Packbots.

Tuesday, April 19, 2011

Because it is Passover. Moses meets the internet

My daughter's teacher showed this video to me this morning during a parent teacher meeting. I couldn't stop laughing! Enjoy...

Google Exodus

Home Field Advantage

With the Red Sox having their only wins so far this year at home along with their reputation of being tooled to perform better at Fenway, I decided to take a quick look at the American League teams Home versus Away wins in the Red Sox Bill James era (2003-2010). Listed below are the differences between home wins and away wins for each team per year in the time period along with their corresponding mean and Standard Deviation.

2010 2009 2008 2007 2006 2005 2004 2003 Mean STD
Tampa Bay 2 20 17 8 21 13 12 19 14.00 6.59
Boston 3 17 17 6 10 13 12 11 11.13 4.88
NYYanks 11 11 17 10 3 11 13 -1 9.38 5.71
Toronto 5 13 18 15 13 6 13 -4 9.88 7.10
Baltiore 8 14 7 1 10 -2 -2 9 5.63 5.93
Chi White Sox 2 7 19 4 8 -5 19 16 8.75 8.65
Minnesota 12 11 18 3 12 7 6 6 9.38 4.78
Cleveland 7 5 19 6 10 -7 8 8 7.00 7.13
Kansas City 9 16 1 1 6 7 8 -3 5.63 5.90
Detroit 23 16 6 2 -3 12 4 3 7.88 8.51
LA Angels 6 1 0 14 1 3 -2 13 4.50 6.02
Texans 12 9 1 19 -2 7 13 15 9.25 7.07
Oakland 6 11 11 4 5 -3 13 18 8.13 6.47
Seattle 9 5 9 10 10 9 13 7 9.00 2.33

It is no surprise that the Red Sox have a strong Home/Away Wins record (2nd),

but I was surprised that the Tampa Bay Devil Rays were even better in this

statistic. Also note how dominate the AL East is in Home Wins versus Away Wins.

Now for the Red Sox in 2011 having all your wins at home is still bad.

Sunday, April 17, 2011

What to do when another Blogger writes about a topic better than you

The answer is to re-post. I follow the Falkenblog because he is a fun read, and he has an interesting history. He did a post about about the percentage of male children question I posted earlier. The comments he got were great. In the post he also includes a link to Steve Landsburg's Blog  where he gives his approach to a solution and the approaches and solutions of some other people. I first heard this problem in college from Joe Mitchell when I was at Cornell. Joe is at SUNY Stony Brook now. That was the same class where I first heard of the Monty Hall problem, Simpson's Paradox and the Inspection Paradox. I will always fondly remember the only class I actually attended.

It was funny to me that these conversations always go abstract math with such ease and speed. Post after post about what happens as the number of children born to a family approaches infinity, but there was no discussion that there is no way for a family to have an infinite number of kids and the process should be truncated with a certain probability that the family just stops after each child. Our models will only be as good as the rules we create to accurately reflect the reality we are trying to model.

Saturday, April 16, 2011

doRedis: A parallel back end for R/foreach using Redis.

Bryan Lewis recently did a Vignette using RStudio to run some financial data using R/foreach and doRedis. Parallelizing R continues to become more important as users want to run computations on larger and larger quantities of data in reasonable amounts of time particularly on clusters or the cloud. doRedis includes the following features:

  • Support for dynamic pools of parallel workers during running computations.
  • Simple cross-platform parallel computing, including at least Windows, GNU/Linux and OS X.
  • Fault-tolerant 
doRedis Vignette


Friday, April 15, 2011

Would we have found out San Diego Basketball through Analytics

This week the San Diego State Basketball program had a number of its players arrested for shaving points in a game in February of 2010 and trying to do the same thing in a game against UCR in February of 2011 (News Story). It appears that these guys were caught by human intelligence. This type of cheating is bad for both the sports organizations (NCAA, NFL, NBA, MLB) and the major betting community. It is bad for the sports organizations because who, except WWE fans, are going to watch fixed match. It is bad for the betting community because they really make their money on the Juice they charge gamblers. The betting community wants fair games that split the money evenly over the spread. Anything that shifts that is a problem for their business model. People believing that the games are fixed could reduce the amount of money bet on games. Also bad for the Bookies.

In the book Freakomonics by Levitt and Dubner, they expose match fixing in sumo matches using statistical analysis. It was a fun read and showed that analytics have the ability to expose cheating in sports without human intelligence. It was also a safe sport to look at because Americans do not really care about sumo nor do they bet on it.

It is interesting to me that there exists so much data and analysis with a goal of prediction on sport, but I have found nothing on using predictive analytics to discover point shaving or game fixing. I realize that doing this kind of work in team sports would be more completed than something like sumo. I just feel that it might be another tool to add to the effort to deter this kind of problem. At first pass it seems the most likely times there is potential cheating is when a players statistics in a game are an outlyer, and the team did not beat the spread. The problem that I see is the sparsity of known point shaving in games. For example I only know of one alleged fixed game in the NCAA basketball season in 2010 out of something like 5,000 games. I do not think that is enough to be useful. Sad to say that if there were more fixed games we might be able to build a better model. Someone suggest to me that I would get better data on game fixing if I looked at Italian football. However, I call it soccer and care Italian football about as much as I do about sumo.

So I have no data to present or model to put forth. I just hate cheaters, and this story has bothered me since it came out on April 11th.

Tuesday, April 12, 2011

Analytics, Sabermetrics, Data Mining...Why can't we all just get along?

Sabermetrics was a term coined by Bill James to describe the analysis of baseball through objective evidence. Saber, or more accurately SABR, stands for the Society for American Baseball Research. With Bill James as its advocate. Sabermetrics has changed the way baseball is played. No easy task in a sport so encumbered by tradition. Baseball probably collects more data during a game than any other sport and each team plays at least 162 games a year. Rich data territory compared to the 16 regular season games played in the NFL.  Sabermetrics has taken a hard look at the core beliefs of what statistics make a good baseball player or team and runs them against the cold judgement of analytics. The results showed that some previously treasured statistics like batting average were not as important statistics as once thought, but others like on base percentage were better indicators. This is predictive analytics at it best. So it is time to call Sabermatrics what it is analytics.

It is funny for all the impact Sabermetrics has had on baseball I believe it is still limited by the traditions of baseball. Let me give you some examples.

The Blog Sabermetic Research talks about Buck Showalter changing the way his base runners play to gain 5 runs per year which he claims is worth $10 million dollars. Makes sense if the data he is using is good, but the key here is the decision is claimed to be made solely on the numbers.

Pitching is another story. In baseball a starting pitcher must pitch five full innings in order to earn a decision (win/loss).  Many talk about the difference between ERAs of starting versus relief pitchers. The data clearly shows that relief pitchers, even when they are the same person,  have an overall ERA .50 lower than starting pitcher or better. Tango on Baseball touches on the subject in this article. My question is that if relief pitchers have a better ERA than stating pitchers, and starters are generally accepted to be better pitchers than relievers why aren't starters being used like relievers? The impact would be huge! A quick pass says this .50 ERA reduction in starting pitchers would result in 40 less runs allowed by a team over the course of a season! Using Showalter math that is $80 million dollars. I believe the reason that this is not looked at as a solution is because of tradition. If starting pitchers where used like relievers they would never pitcher 5 innings, and therefore would never get  a decision. This would be a fundamental change in the way baseball is played.

In defense of Sabermetricians, there has been some discussion that ERA, like BA, is not a very useful statistic. This would mean that conclusions drawn from those statistics may not be as useful as they appear. I have not seen anything on starters versus relievers in terms of CERA, dERA, DICE or DIPS.

Monday, April 11, 2011

The Most Boring day in the Last hundred Years

I was driving in to work this morning and listening to the radio because I feel that distractions make me a better driver. A news article comes on telling me that a Cambridge researcher has found out that April 11, 1954 is the most boring day in the last 100+ years. I had a good laugh thinking that it isn't only the US government that gives out silly grants for pointless research ( remember the which came first the chicken or the egg paper).

So I dig a little deeper. Turns out this is a software guy who just launched his "smart" search engine. I could not find any specifics on how the engine works which would have been cool, but then I thought this is maybe even cooler than another algorithm. Here is a guy who got this new web site rolled out as a news story in three major newspapers and NPR just by asking an interesting question about the most boring day.  It would be really amazing if the search engine gave him the question after he queried "the most likely answer to a question to be picked up by major news organizations".

The Most Boring Day Article
True Knowledge
Tunstall-Pedoe and his search engine

Saturday, April 9, 2011

A Fun Problem to play with....

I am always on the lookout for interesting brain teasers, and this is a fun one. It comes from a blog by Phil Birnbaum which I found a few days ago and I just love. His Sabermetrics stuff is awesome!

A king decides that his country has too many men and not enough women. So he issues a decree: once a couple has a boy, they're not allowed to have any more babies.

The king reasons as follows: no family will have more than one boy. But some families will have two girls, or three girls, or even six girls before they have a boy. So there will wind up being a lot more girls than boys.

Is the king's reasoning correct?
The answer: the king's reasoning is not correct. There will still be approaching 50 percent boys, and 50 percent girls. There are many ways to figure this out. 

I believe this assumes that the chances of having a boy or girl are equally likely. and independent.


Thursday, April 7, 2011

What Operating Systems do R users use?

I went to the Greater Boston R meetup last night where the Rstudio guys did a presentation on their IDE. It was well attended, and Josh and JJ did a great job.

At the end of the talk JJ said that the breakdown of Operating Systems for the Rstudio downloads was roughly 60% windows, 30% mac and 10% linux. Frankly I was surprised by those numbers. I always felt that the majority of people who use open source R would run it on an open source Operating System. Rstudio's results would seem to strongly counter that thought. After the talk, I remembered a survey done by KDnuggets in September of 2007 on what Operating system did people use for their analytics work. It also showed about 60% Windows users but 30% linux and 10% Mac. I believe this suggests an increase in the number of people using Macs to do analytics in the last four years.

I am not sure the data from downloads of an IDE or a survey of Operating Systems by KDnuggets is a good indicator of what is the relative popularity of the various Operating Systems being used by the R community. I do believe it does show the importance of supporting Windows which is consistantly shown as the dominate platform among people surveyed and downloads at least Rstudio. It also showed that if you make an IDE that is not supported on the Mac you could be missing 30% of the potential users.

The recent KDnuggets survey on which R interfaces which was interesting. For example, I would not have expect so many people to respond that they use Tinn-R or R commander but they do.  I always enjoyed the KDnuggets surveys.

KDnuggets Operating System survey
KDnuggets R interface poll

Monday, April 4, 2011

Why are the Red Sox better today? Sabremetrics or Construction?

I saw an article this morning from an MIT professor that predicted the Red Sox would win 100 games this year. That is a pretty bold statement since the Red Sox have only won 100 games in a season three times (1912, 1915 and 1946). However, it got me to wondering how have the Red Sox become so good in recent history. I often heard comments the claim that it is the payroll or the genius of Theo Epstein. Whenever I am with statistics guys, it is the hiring of Bill James and the use of Sabremetrics that made the difference. I have a third theory to put forth as the major reason for the improvement of the Red Sox in recent history, construction at Fenway. Oddly, this started the same year that Bill James was hired by the Red Sox, 2003.

From 1995 to 2002 the Red Sox had a combined record of 695-582 winning 54.42% of their games. From 2003 to 2010 the Red Sox had a combined record of 749-547 winning 57.79% of their games.

Year W L Winning % Year W L Winning %
2010 89 73 54.94% 2002 93 69 57.41%
2009 95 67 58.64% 2001 82 79 50.93%
2008 95 67 58.64% 2000 85 77 52.47%
2007 96 66 59.26% 1999 94 68 58.02%
2006 86 76 53.09% 1998 92 70 56.79%
2005 95 67 58.64% 1997 78 84 48.15%
2004 98 64 60.49% 1996 85 77 52.47%
2003 95 67 58.64% 1995 86 58 59.72%
749 547 57.79% 695 582 54.42%

So they are a got better after 2003 and Theo is a genius and Sabremetrics rules baseball. I am not so sure, and I think we reach those numbers based on a Simpson's paradox. Let me explain. If Sabremetrics had been the driving reason for the improvement the Red Sox. they would have gotten better not only at home but away as well. They did not. In fact the Red Sox improved massively at home, but got worse on the road. So what is the factor that explains this? In 2003, the same year Bill James was hired by the Red Sox, additional seating was added the Fenway park for the first time since it was 1946. While it was was always known that Fenway was helpful to certain types of hitters and pitchers and the Red Sox teams have always emphasized those players. I believe that construction made the park even more baised than it was before.

During the period 1995 to 2002 the Red Sox had a better away record than they did from 2003-2010.

Away Record
W L % W L %
2010 40 41 49.38% 2002 51 30 62.96%
2009 39 42 48.15% 2001 41 39 51.25%
2008 39 42 48.15% 2000 43 38 53.09%
2007 40 41 49.38% 1999 45 36 55.56%
2006 35 46 43.21% 1998 41 40 50.62%
2005 41 40 50.62% 1997 39 42 48.15%
2004 43 38 53.09% 1996 38 43 46.91%
2003 42 39 51.85% 1995 43 28 60.56%
total 319 329 49.23% 341 296 53.53%

For Home games it is a very Different story:

Home Record

W L %

W L %
2010 49 32 60.49%
2002 42 39 51.85%
2009 56 25 69.14%
2001 41 40 50.62%
2008 56 25 69.14%
2000 42 39 51.85%
2007 56 25 69.14%
1999 49 32 60.49%
2006 51 30 62.96%
1998 51 30 62.96%
2005 54 27 66.67%
1997 39 42 48.15%
2004 55 26 67.90%
1996 47 34 58.02%
2003 53 28 65.43%
1995 43 30 58.90%
total 430 218 66.36%

354 286 55.31%

MIT economist says Red Sox will win 100 games in 2011