Big Computing: May 2011

Tuesday, May 31, 2011

Bryan Lewis's Vignette on IRLBA for SVD in R

The Implicitly Restarted Lanczos Bidiagonalization Algorithm (IRLBA) of Jim Baglama and Lothar Reichel is a state of the art method for computing a few singular vectors and corresponding singular values of huge matrices.

The IRLBA package is the R language implementation of the method. With it, you can compute partial SVDs and principal component analyses of very large scale data. The package works well with sparse matrices and with other matrix classes like those provided by the Bigmemory package.

In Video Vignette Link I have inserted below Bryan with a new microphone goes through an example using this package on the Netflix Prize data set (480K row by 18K columns). Competitions like the Netflix Prize and the Kaggle.com competitions have really brought powerful tools like SVD into greater use.

Video Vignette or IRLBA using the Netflix Prize data set.

Monday, May 30, 2011

Memorial Day is to remember the fallen

America has lost around 1,500,000 soldiers in its wars over the last 250 years. Over time technology has replace lives in battles, but people still die and suffer to forward the policies and ideals of our country. I took a quick look at the deaths of soldiers in terms of overall US populations to try to quantify the impact of war on the general population. I think it is interesting that the Wars we mostly widely discuss are those that highest percentage of deaths relative to overall population. Hopefully the trend of fewer paying the price will continue.

War	Deaths	US Population	% of population
Revolution	25000	3929214	0.636%
1812	20000	7329881	0.273%
Mexican	13283	12866020	0.103%
Civil	625000	31443321	1.988%
WWI	116516	91972266	0.127%
WWII	405349	131669275	0.308%
Korean	53686	150697361	0.036%
Vietnam	58206	203302031	0.029%
Terror	5716	281421906	0.002%

Saturday, May 28, 2011

R/Finance 2011 Presentations now Online

I was at the R/Finance 2011 in Chicago at the end of April. It was a great conference and very reasonable at only $200. I would strongly encourage anyone in the field of finance or analytics in general to attend. I know I will be back next year!

The PDFs of the 2011 presentations are here.

Friday, May 27, 2011

UseR T-shirt design contest

It seems only right that a software platform with thousands of packages should have a substatial wardrobe to go along with it. In a continuing effort to increase the number of clothing items related to the R language the 2011 UseR organizing committee along with Mango Solutions have released a T-shirt design competition.

I am not sure when the R themed clothing started, but we did a couple of T-shirts back at Revolution Computing. They mostly featured fractals and the R logo, but they were black and therefore cool.

The first really great R shirt I ever saw was from Drew Conway and the NYC Data Mafia. I think it still may be the best out there.

REvolution Analytics recently came up with the I Love R T-shirt (I can not find an Image). I first saw them at the R/Finance meeting in Chicago. Their 70s retro look seemed to be very popular. In fact, a recent twitter contest attracted over a thousand crazed statisticians in pursuit of five T-shirts.

Always trying to copy a good idea I came up with a design of my own and a twitter contest as well. Sadly, while Revolution added followers my social media effort actually resulted in a reduction in followers. See my failed design below.

Never one to give up after crashing into the ground in a ball of flames, I have decided to enter the fray once again with me new R package and T-shirt design. I call it R-apture. R-apture is a twitter data mining package that scans for posts that would eliminate the sender from being called to Heaven and then sends a comment back to that tweeter that they are screwed. The T-shirt design is simple with the words " I am getting R-aptured on ~~May 21~~ October 21"( the R is an R-project R and the T is in the shape of a cross). I am sure this one will work better then the last.

Thursday, May 26, 2011

Using Social Media to predict Marcoeconomic tends

While I am very leery of using Twitter to predict economic trends particularly in time to anticipate trading trends in the financial markets, I do believe they may be useful in terms of predicting marcoeconomic trends.

Google introduced this idea in 2009 with a paper by Hal Varian. This work was based on the mining of Google searches, and it proved to be pretty effective. I believe similar or maybe more accurate results can be obtained by mining social media. This is because many tweets and posts are richer than simple Google searches. If you are looking for a good R base twitter client try twitteR. The same author of that package wrote an R based Stack Exchange client called RStackExchange. RStackExchange may provide some foreshadowing of technology and software trends

Wednesday, May 25, 2011

Cleveland Indians are better than Sabermetricians Predicted

When I was at the Sabermetric Seminar in Boston. The Indians success in the first quarter of the year was a topic of discussion. The explanation given by Tom Tippett was that the Indians where over performing against the model and would over the course of the season return to their expectation. In support of that an expected run chart was put up showing the Indians with the greatest positive actual run differential versus expected run differential. The Red Sox were underperforming in respect to this measure.

While I understand there will always be statistical anomolies and periodic straying from the mean, I am not so sure that this is the case here. Modelers have a tendency to explain away differences from reality compared to there models as variation. While that may and will be the case sometimes for a three standard deviation outlier we are talking about a 3 in 1,000 chance. Rather than take that bet I would check to see if my model failed to take something into account. In the case of the Indians improvement, I would be more likely to look for shortcomings in my model because the Indians are a Sabermetric driven team and the guy who runs their analytics is a very talented guy. Teams do not share their models so there is no way of know if the various model are similar or even what input Data they use. A general impression from the Sabermetric conference is that Sabermaatricians do a lot of regression to the league mean which will smooth out the data, but may also underemphasize relevant data.

I believe even a quick look at even high level data for the Indians suggests their performance is not a wandering away from the mean but a shift in the mean. Most of the difference in 2011 can be attributed to the 233 runs scored in 46 games or 5 runs per game compared to 4 run per game in 2010. This can be explained because most sabermetric models fail to incorporate injuries into their models which was a factor in 2010 for the Indians and would negatively effect their run prediction in 2011. A lack of injury prediction and weighting due to past injuries in Sabermetric models is a major disconnect in Sabermetrics and needs to be addressed. The Healthcare industry has made great strides in this area in recent history with the use on ensemble methods.

Tuesday, May 24, 2011

Vincent Carey on Tap for the Next Boston R Users groups

The next Greater Boston useR Group is on June 1. The Meetup will feature two speakers. The introductory speaker is our own John Muller of State Street Capital who will give a short talk on Time Series Analysis in R. This will be followed by Vincent Carey.

Dr. Carey is an Associate Professor of Medicine (Biostatistics) at Harvard Medical School and is a co-founder and core member of bioconductor.org. Bioconductor defines software architecture for various high-throughput experimental platforms, with particular attention to quality control, annotation, and flexible interface to statistical inference tools. The portal now contains hundreds of contributions from research centers all over the world. Bioconductor is one of the mostly widely used tools in Pharma where at least half our the consulting work I have done in the Pharma space included Bioconductor. Robert Gentlemen once estimated that 5% of all R users are regular users of Bioconductor. Considering that there are over 3000 different R packages that is an astounding percentage. Dr Carey is also an associate editor of the Journal of Statistical Software and the R-news and the Senior Statistician with the Pediatric AIDS Clinical Trials Group.

His Talk is titled "Exploring genetics of gene expression with R/Bioconductor". If you are near Boston and have an interest in R or Biostatistics you do not want to miss this talk by one of the thought leaders in these areas.

Monday, May 23, 2011

Sabermetrics Seminar

I went to the Sabermetrics Seminar at Harvard this weekend. It was a charity event, and all the speakers came and talked on their own dime. I just want to thank those speakers for giving up their time for such a great cause.

The Seminar itself was an eye opening experience for me. The last seminar I went to was the R/Finance in Chicago. That Seminar, like most that I go to, is for hard core statisticians and computer scientists. I believe of the hundreds of attendees to R/finance I am one of the few without a PhD. The presentations with the possible few exceptions of JD Long's honoring of Dr Suess were of a highly technicial level. The Sabermatrics Seminar was totally different. The audience varied from the Head of the Harvard Statistics Department and an eminent physicist to people with very limited mathamatical education. The presentations also ran the gambit from something that would be taught in a high school physics class to some fairly high level stuff. The great unifier in the room was these people loved baseball and where using mathamatics to expand their understanding of the game and increase their enjoyment. One Speaker, Dan Duquette, former GM of the Boston Red Sox, reminded us of the words of Flippe Alou to "remember to enjoy the game". Tom Tippet, Director of Baseball Information Systems, gave a great Q&A on the state of Sabermetrics in MLB today. I have included a link to a summary of the seminar here.

Sabermetrics is different than the other fields I work in. In Pharma, the models are widely shared, but the data is highly confidential. In Finance the models are confidential, but the data is basically public. MLB analysts seem to strongly guard both their models and there data viewing both as propietary. While I think this makes it a great opportunity for consulting, I believe it may hinder the rate of refinement. Kaggle has shown in a very public way that open collaboration on data and models yields astounding improvements in prediction.

Saturday, May 21, 2011

Post Rapture Pet Care

When does this Rapture thing run its course? If the end of the World is coming and all good Christians get saved. I am not sure the earth or what is left of it is that important. Then in the car on the way home from Boston I heard an interview with a guy who runs a Post Rapture pet care company which will take care of your pet after Rapture for $135. I thought it was a joke. Silly me, it turns out it is real and there are lots of companies that offer this. If I am going to heaven and the world is ending my biggest concern is not Fido. I love animals, but come on.

So looking at all these websites that offer this service. I find one with a video. I have not option but to post it. So here is Post Rapture Pet Care

Friday, May 20, 2011

It is a Sabermetrics Weekend so todays post is Sabermetrics

This weekend I am going to the Sabermetrics Seminar in Boston. Some might think that it is strange that I am excited about this given that I never played baseball, and I do not watch many games. However, the analytics being done is baseball is developing and expanding at such a rapid pace there is no way you can enjoy analytics and not be interested. Recently a friend of mine ran into Prof. Bertsimas and asked if he could have a copy of the now famous paper that he wrote predicting the Red Sox would win 100 games this year. Prof Bertsimas asked "are you a fan of baseball?" to which he responded "No, I am a fan of statistics".

The development of Sabermetrics in the last 30 years has been to look at existing data and try to build predictive models out of that data. It was a good first step and produced some good results. This work revealed that some of the historical statistics, like ERA, were not good predictors of anything so Sabermetricians created statistics that were better predictors. This is all great, and it has taken Sabermetrics to where it is today.

The problem with the data that has been used today in baseball is that it is all result based data. The pitcher threw a strike or a ball, the batter got on base, etc. That is all changing. Welcome to the world of physical data in Baseball. This post on Beyond the Boxscore is a good example. It has taken the improvement of a players performance back to the physical location of his pitch not just that more of his pitches resulted in ground balls, but an attempt to answer why based on data not opinion. The technology exists not only to track data of a baseball as it crosses the plate but within the entire ballpark. First this is going to create an unbelievable amount of data that needs to be in studied in ways not currently used in baseball because of shear volume. Second this data is collected in real time which means the models could be updated in real time. Billy Bean may have had his 3X5 note card in front him, but the manager of the future may be holding his iPad with feedback on up to the last pitch and the suggested options with predicted results of those options.

One of the companies doing this physical data collection in baseball is Trackman. They also recently posted for an R developer. I can not wait to see what is coming!

Wednesday, May 18, 2011

Trading based on Twitter sentiment. Pour me a double!

Alright it happened yesterday, someone finally got drunk enough to start a hedge fund based on twitter sentiment. I love the article I linked because it says there is a connection between the twitter sentiment and the stock market, but no one understands what that relationship is, and in some sense it does not really matter. Really!?! Understanding relationships is the key, and is particularly important to preventing disaster if those two things ever become unlinked. For example, two cars follow each other on a road for many miles. We might then build a model that shows one car always follows the other. Later those cars get to an intersection and one turns left and one turns right. Model disaster! The same thing is the problem with social sentiment models. At R/finance 2011 Rothermich presented a paper where he built a trading model on sentiment from the dancability of the most popular songs in various cities. It worked great. Maybe he should set up a fund as well. Hey you may lose money, but at least you can dance to it.

Then I came across this great article that points out the potential for fraud in social networks and the effect on trading. Awesome article! We often forget how many people on twitter are not people at all, but companies, PR firms, criminals and bots. Sentiment on Twitter can be and is managed in some cases.

This is a bad trading idea that is a great marketing idea. People will invest their money, and they will well compensate the administrators of the hedge funds as they lose money. I am just a small voice in the social media world drowned out by other sentiment. Besides there will never be an internet bust, derivatives are good for pension funds, housing will never collapse and yes Virginia there is a Santa Claus.

Tuesday, May 17, 2011

All great ideas will be copied

I am not sure when they actually started doing predictive analytics competitions, but in the last year I do not think a day has gone by without me hearing something about Kaggle. While I do not always agree with the structure of some of the contests particularly with the recent Heritage Health Prize contest license, there is no doubt of the impact Kaggle's contests have had on improving models and interest in those models. I can not count the number of Meetups that I have gone to that the presentation was the result of the work the presenter had done on a Kaggle competition. I have also been to a number of meetings where the Kaggle guys themselves have joined in the presentation and subsequent conversations.

In fact, Anthony Goldbloom is presenting at DC R user group tonight May 17. Anthony will also present at the Philadelphia UseR Group on May 26.

Now comes the rush of the me too contests. On Friday I got an email about an Overstock.com contest for the reclab prize for $1,000,000. I like this contest less than the Heritage Health Prize because in addition to the restrictive software license there is a peer review section rather than a scoring system. I really view this as weak copy of what the Kaggle guys have already done rather than an step forward. So rather than waste time on talking about why I think these competitions need to be open in order to achieve good results I want to look at ways I think they can be better in general.

My last two companies have spent countless hours working on not only how to get a good answer, but also an answer in a reasonable amount of time. We do that by doing a lot of code optimization and parallelization. We have had a lot of success. However, there usually comes a time were we need to give up a little predictive accuracy to reduce processing time. Given the size of some of these potential data sets and their expected growth it seems logical that some contests should have a computation time element to their scoring system. I have also heard that some contestants have improved their results by tuning or incorporating outside information into their models. While I think this is unfair if the competition specifically prohibits it, I firmly believe these contests also have their value. We have worked on many a model that became a powerful predictor after the addition of outside data or incorporation of expert opinion.

These are just two simple ideas, but I think they and others like them have the potential to improve and expand the reach of these contests. The addition of other elements will attract other types of talent to these competitions (HPC, Factor researchers, forensics, etc.) producing even better results.

Finally at the other end of the spectrum I always thought a Kaggle Contest Newbie Kit would be great thing. This could be basic as pre-loaded R packages like Max Kuhn's Caret with some additions to simplify use. This would lower the barrier to entry and bring the next generation of teams into the game faster so they can contribute real improvements faster. Besides since most people baseline the data before they move onto more complex models this would relieve some of that work and give more time to perfect the final submission.

Monday, May 16, 2011

R Websockets

When I was out at R/Finance in 2011 Chicago, I spent some time with Bryan Lewis. He was excited about a project he was working on called R Websockets. Apparently, it uses javascript and works like TCP sockets but on the Web. R Websockets. I will post more on this as soon as I get it.

Bryan has just posted the video vignette at Bigcomputing.com. Bryan has now posted all three parts to his video on the website. For the record I also believe his video skills are good.

Sunday, May 15, 2011

In defense of the abuse of Statistics and their charts

I took a Statistics class in 1990 at Cornell from a guy named Lionel Weiss. He started the first class by telling us he was going to teach us to "make numbers lie". His point was that we always need to be very careful in how we looked at data, pick our models and verify our result. It sounds simple and basic that it was hard for me to believe it even needed to be said in an undergraduate class. Boy was I a naive optimist.

The thing I failed to recognize at the time was most analysis is not done to figure out what is going on, but to support a position already taken by the analyst or the person the analyst works for. The results of this would be funny if they weren't so scarey.

Andrew Gelman wrote a post on his blog recently about research papers out of China which he got from the Statistics blog forum. While no conclusions are drawn from the study it is suggested that the Chinese researchers knew the conclusion they are supposed to get and therefore get an even better result than was had before. While it would be comforting for us to say that the problem exists over there and not here, I have heard enough stories to say that is not true. I recently talked to an individual the was working on a research paper with another person. The second person developed analysis that supported one of his own beliefs. The first researcher strongly disagreed and pointed to some major problems in the other researchers analysis. The result was the first researcher had his name removed from the paper, but the second researcher publish the paper anyway.

Junk charts not only goes after visualizations that are misleading because of poor representation choices, but also those that may have been chosen to be misleading. In this second group I would put this graphic posted on Junk charts. There are so many oversights in this chart comparison, and it is hard to argue this was oversight rather than an overzealous analyst trying to support a predetermined conclusion.

On the blog Numbers rule your world I saw this post comparing the life expectancy versus the number of retirement years. This has the problem of comparing averages of two disconnected things, but it is another example of sloppy work or a graph to put forth a desired result.

So more than 20 years later Lionel was right. We can make "numbers lie". However, if we want to produce useful results we must fight that urge to produce a result that supports are own bias. We should make that extra effort to refute those findings that support our own beliefs before we publish them to the world. Let the data speak to us instead of us telling the data what to say.

Friday, May 13, 2011

NYC R user/Predictive Analytics Meetup on May 12

Last night at the AOL Headquarters in NYC Max Kuhn of Pfizer presented his Caret Package. There were over 100 people in attendance. It is a great package that is often used in Predictive Analytics contests like Kaggle. A copy of his slides are are on the Meetup site here. In fact the last four talks have been great. If you are newer to these Predictive Analytics Contests start with Puniyani's slides, followed by Max's and then to then on to Alex Lin and John Myles White. This is a superb group that attracts great presenters on some really cool topics. Max will be doing a class predictive analytics in R on October 16th and 17th at the New York City Predictive Analytics World.

Monday, May 9, 2011

Stonebreaker Comments on Caffiene and Linking SciDB to R

Dr. Micheal Stonebraker of MIT had a short presentation on SciDB on Friday. SciDB is an interesting project because of its stated goal of addressing the needs of the research community. Dr. Stonebraker is one of the thought leaders when its comes to databases whether you agree with him or not. In 2008 he wrote a paper saying that Hadoop and Map Reduce were a steps backwards in terms of technology development. In 2010 Google dropped Map Reduce in favor of Caffeine. I really hope Caffeine uses Java Beans!

The meeting on Friday was to announce the release of SciDB V1.0 which is supposed to be a much more feature rich than the current V0.75. It is also interesting to note that the only analytics environment they plan to integrate with is R because that is the only one SciDB users use. While I am not sure doing a open source project that is for research only is the best idea because it may cut down on some useful contributions from the non-academic world, I do think SciDB is an interesting project.

Saturday, May 7, 2011

Gelman writes about Bill James

I always knew that Professor Andrew Gelman of Columbia was a well known Statistician and Social Scientist, but when he writes an article in the Baseball Prospectus now he is famous. It is always good to see a statistician write about a sabermetrician. Although these two fields are really the same it seems they try to separate themselves from each other.

It was interesting to me that Gelmen wrote about James as a baseball outsider not too different from what James was to baseball in 1984 when he wrote the "Inside-out Perspective" article. Baseball may always need the outsider prospective to push it along because its traditions and beliefs are so deep.

I thought one interesting issue that Gelmen touched upon was how little of the real work in sabermetrics gets published. When Gelmen works on a topic he publishes a paper that discusses his approach, provides an example and the code to run the example yourself. Not so in Sabermetrics. I find little detail in the published articles and very little code. This results in people like James moving away from positions and theories without explanation. I think this hurts the development of Sabermetrics in some ways. My view on how science is developed is the path of how gravity was discovered through a series of theories that we accepted and then rejected. First there was nature abores a vacuum, then there was nature abores a vacuum up to 32 feet and then finally there was gravity at 32ft/sec.

It was a fun article to read in preparation of my attendance at the Sabermetrics Seminar at Harvard May 21-22

Friday, May 6, 2011

Heritage Health Prize goes against Open Source

Today on KDnuggets I read the Heritage Health Prize recently modified the License agreement to make the work product the sole property of Heritage Health. I think this is wrong. If you want to develop a proprietary algorithm go hire someone to do it, but to claim all the work product submitted in the competition even the ones that do not win and therefore are not paid for is just wrong.

Heritage Health can not have their cake and eat it too. Kaggle has been very clear that their site has been the develop cheaper, faster analytic tools for its customers ( the contest sponsors) at a lower cost than they could do otherwise. That is fine and the contest sponsors should use and implement the models submitted to the contest. However, what we have seen is a collaborative approach wins these competitions, and a sharing of how they did win with the larger community sometimes on the Kaggle site itself makes future models even better. If predictive analytics is going to makes the leaps forward that it really needs to do it can only happen in a open collaborative environment which not only encourages but demands the sharing of information, algorithms and approaches. If we do not, analytics will cease to progress at the rate that it has been in recent history, and we will return to the bad old days of investment companies jealously guarding their superior infinite random walks from the other investments houses.

It is no coincidence that predictive analytics took off with the advent of open source software. The R environment is a shining example of that which also wins most of the Kaggle contests. It is better than what came before and will continue to improve because of the collaborative contributions of its dedicated users.

Tristan has called for a boycott of this contest. The thread bring out some other outlandish and real issues of concern.

Thursday, May 5, 2011

Things are heating up in Boston

I just came back from the Greater Boston useRs Group. In the past few months this group has really taken off. Last nights opening speaker was from frequent presenter Jeffrey Breen on using R with Databases. His topics are so relevant to such a large portion of the community that it has already been posted by R blogs like David Smith's Revolutions Blog.

After that Mike Kane of Yale University presented the EsperR package that was written by Bryan Lewis. While the talk really delt with working on financial data streams, the application is usable in any field that does analysis of streaming data.

Another attendee of the Meeting was John Verostek. He runs the Predictive Analytics Meetup in Boston.There is a great deal of synergies between the two groups. They have an upcoming event on Text Mining Utilizing the Twitter API with R. The Greater Boston useRs Group next Meetup is June 1 with Vincent Carey on BioConductor.

Tuesday, May 3, 2011

In Baseball too much data is never enough

A couple of weeks ago I ran across a post for a intern position at TrackMan which uses information of ball flight to improve performance. They have been very successful in golf. In fact I tried one of their units out over the winter. This job post was more interesting to me because it was looking for an analytic intern for baseball. My first reaction was just what baseball needs more data points in a hulking cloud of data. Bill James and the Sabermatrics guys have already culled and studied the baseball stats to death even throwing out some stats as irrelavent and creating some others that are better predictors of results.

Then I realized the error of my ways. TrackMan is looking to enrich the result data with physical data. So not just if the ball was a strike or hit or even if it was a fast ball or a curve ball, but what was its speed, location and spin at points along its trajectory from mound to plate. This is very cool. In her talk at the NYC Rusers group Amanda Cox presents a heat map of Rivera's pitches crossing the plate versus other pitchers which was a simple piece of the total pitch but explains why Rivera was better in a very clear way (22:00). I believe this has the potential to change the way pitchers pitch and batters hit.

Monday, May 2, 2011

What did I learn for R/Finance 2011

The R/finance 2011 meeting was a huge success! All the talks were just great. I do not have the time to go through each talk one by one but I do feel there were a couple of themes that ran through the entire conference. The opening speaker, Mebane Faber, and the keynote speaker, John Bollinger, touched on two topics near to my heart. The first is that in many cases the simplified model does nearly as well as the more complex one and in some case with fewer pitfalls. The second is that models are our attempt to describe reality, but they are not reality. Therefore there is always the possibility that the model is a bad fit for the reality that it is trying to model or there exists a deviation from the model to the reality it is describing. Both phenomenons can be exploited for advantage. Never get blindly enamored with a model and approach things with an opening mind. These ideas carried pretty consistently throughout the conference.

Parallel or High Performance Computing for R are becoming a more and more important factor in analytic computing. I am not sure if it is because to the continue growth of data in general, the enterance of HPC into general awareness through the "cloud", or because the really cool problems seems to exist on the edge of our current capability. I believe with the exposure of more users to HPC tools for R it is time to update the various pros and cons of each approach and to benchmark them against each other with a set of set typical data set and models. I do wonder if the recent problems on Amazons EC2 could will slow down the growth of cloud computing? Lost time is one issue here but the users that lost their data could be much more reluctant to take that risk in the future.

I was also amazed at the traction that Rstudio had among this group of experienced R users. I have always held the belief that experienced users of any software package shy away for IDEs and GUIs and prefer the simple interaction of command line coding. I felt IDE were the tool for new or mid-level users. In this case, I was wrong. Rstudio appears to provide benefit to the very experienced R user to the point they are willing to change away from what they are currently doing and learn this model tool.

I thought JD Long's Dr Seuss inspired talk was the most entertaining of the confernece. It takes some talent to do that and even more to do it well. His Segue for R package is pretty cool too. Flash talks are a great format, and I wish they were used more often

Subscribe To My Blog