Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: All great ideas will be copied

Tuesday, May 17, 2011

All great ideas will be copied

I am not sure when they actually started doing predictive analytics competitions, but in the last year I do not think a day has gone by without me hearing something about Kaggle. While I do not always agree with the structure of some of the contests particularly with the recent Heritage Health Prize contest license, there is no doubt of the impact Kaggle's contests have had on improving models and interest in those models. I can not count the number of Meetups that I have gone to that the presentation was the result of the work the presenter had done on a Kaggle competition. I have also been to a number of meetings where the Kaggle guys themselves have joined in the presentation and subsequent conversations.

In fact, Anthony Goldbloom is presenting at DC R user group tonight May 17. Anthony will also present at the Philadelphia UseR Group on May 26.

Now comes the rush of the me too contests. On Friday I got an email about an contest for the reclab prize for $1,000,000. I like this contest less than the Heritage Health Prize because in addition to the restrictive software license there is a peer review section rather than a scoring system. I really view this as weak copy of what the Kaggle guys have already done rather than an step forward. So rather than waste time on talking about why I think these competitions need to be open in order to achieve good results I want to look at ways I think they can be better in general.

My last two companies have spent countless hours working on not only how to get a good answer, but also an answer in a reasonable amount of time. We do that by doing a lot of code optimization and parallelization. We have had a lot of success.  However,  there usually comes a time were we need to give up a little predictive accuracy to reduce processing time. Given the size of some of these potential data sets and their expected growth it seems logical that some contests should have a computation time element to their scoring system. I have also heard that some contestants have improved their results by tuning or incorporating outside information into their models. While I think this is unfair if the competition specifically prohibits it, I firmly believe these contests also have their value. We have worked on many a model that became a powerful predictor after the addition of outside data or incorporation of expert opinion.

These are just two simple ideas, but I think they and others like them have the potential to improve and expand the reach of these contests. The addition of other elements will attract other types of talent to these competitions (HPC, Factor researchers, forensics, etc.) producing even better results.

Finally at the other end of the spectrum I always thought a Kaggle Contest Newbie Kit would be great thing. This could be basic as pre-loaded R packages like Max Kuhn's Caret with some additions to simplify use. This would lower the barrier to entry and bring the next generation of teams into the game faster so they can contribute real improvements faster. Besides since most people baseline the data before they move onto more complex models this would relieve some of that work and give more time to perfect the final submission.

No comments:

Post a Comment