Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: October 2011

Monday, October 31, 2011

Halloween is for Brain Teasers and Paradoxes

In the last few weeks quite a few people have done postings on brain teasers so I thought I would post a bunch of them here.


IQ Test --Solve it-" Pure Logic test from Harvard University If: 1 = 5, 2 = 25, 3 = 325, 4 = 4325, 5 = ?


(Thanks Vijay Kumar)


If you choose an answer to this question at random, what is the chance you will be correct?
A) 25%
B) 50%
C) 60%
D) 25%


(Thanks Nathan of Flowing Data)


If you have three light switches in one room that each turn on one lamp in another room, and you can only enter the room with the lamps once how do you determine which switch goes to which lamp?


Good video on the Monte Hall Problem ( although I would stay with the first door because I already have a car, but I could use a goat)




Finally here is a cool website of Riddles and Brain Teasers 

If you have any other ones that you like send them to me and I will add them.

Thursday, October 27, 2011

Is there really a Data Scientist shortage or are we victims of our own Predictive Analytics?

Recently I have gone to a number of conventions like Strata NYC and Predictive Analytics World NYC. I heard the same call over and over. There is a storage of Data Scientists! It is going to get worse! We need another 190,00 Data Scientists just to fill the need! For those of you who do not know what a Data Scientist is, Mike Driscoll describes it on Quora as a blend of Red-Bull-Fueled Hacking and espresso-inspired statistics. Awesome!

I started to wonder where this number came from, and how it was developed. Why? Well, I am a Data Scientist of sorts, and I am not confident there is a real shortage of people who do this work or who can do this work. It also raises my alarm bells when I see the same presentations by different people that present the same numbers. The chance of so many people coming to exact the same numbers independently is about as likely as five people in the US dying by drink tap water ( the same chance as winning Powerball). I did a project to estimate the number of R users in 2006 at a Subway on a napkin that was re-used by countless people over the next couple of years. Thank god others have taken a more detailed look at that issue since, and people now use their numbers.

Turns out the 190,000 number comes from McKinsey Global Institute which projects the shortfall by 2018. When I found that out, I really began to question the number which had already been misquoted in most of the presentations I had seen. Some presentation had even presented the 190,000 person shortfall as a current condition rather than a projection for 2018. The term Data Scientist was first coined by Jeff Hammerbacker at Facebook in 2007. I am leary of a projection seven years out for a position that was not even named until four years ago. Reminds me of Morris's paper to predict batting averages for the season for MLB batters using their first 40 at bats. Not a very useful training set.

While I was writing this I was sent a post from Andrew Gelman's blog.  I am a firm believer that no statistics blog post is complete without an Andrew Gelman quote or post so here it is: The #1 way to lie with statistics is...to just lie . Do not read anything into the coincidence of the quote with this post, but the timing is surprising. Besides it is a good warning to us all to let the data speak for itself, and not try to support our own opinions through use of statistics or lack thereof.

Now to the Mckinsey Report. If you are dying to read all 156 pages of the report here is the link: McKinsey Big Data Report. You will need the Red Bulls and Espressos that Mike Driscoll mentioned earlier! I will save you the time. Mckinsey talks about how they can to that number on page 134 in the appendix. I see a lot of problems. First there is no data or sample data, and there is no description of the predictive model used. Without the means to attempt to validate, I have to question if the conclusion is valid. In their brief description of what they did to come up with these numbers I already see problems. Mckinsey says their raw data is based on SOC code numbers from 2008. That is one year after the term data scientist was coined and what is required to be one has changed quite a bit sense then. A static description of a moving target may be a highly inaccurate. Second, they list the SOC codes they used to determine their population. I see an number of SOC code that Data scientists come from that are missing from the start. The most glaring one is physicist. Some of the best Data scientist in the field are physicists and there are a lot of them in the field.

Looks like we need to get a Data Scientist to look at how many Data Scientists were are going to need in the future.

Wednesday, October 26, 2011

Vegas Odds for Worlds Series Champions 2011 at start of MLB playoffs

It is always fun to look back with knowledge and laugh at how off the odd makers were because we always knew better. The 2011 MLB playoff have been a shinning example of this with the teams given a 8-to-1 and a 12-to-1 chance vying for the title of World Series Champions.

Yes, few believed the Texas Rangers could became Champions having to compete against American league foes of the likes of the New York Yankees (4-to-1), Tampa Bay Rays (9-to-1) and Detroit Tigers (7-to-1). Yet they did.

Fewer still believed the St. Louis Cardinals had any hope against powerhouse teams like the Phillies (9-to-5), Brewers (15-to-2) and Diamondbacks (14-to-1). That is why they play the games.

With all teams being equal the chance of any two particular teams being in the World Series is only about 6%  which is a rare event to begin with. However weighting your brackets with vegas odd usually improves on that. Not this year. Payoff numbers mess it up a little, but the odds makers thought the chances of a St. Louis versus Texas World Series was about 2%. So enjoy the rarity.

Tuesday, October 25, 2011

Some Great R Users Meetups are coming up

If only I had a ton of frequent flier miles I would go to them all. However, I am running a little low so I won't be able to go to them all. Here are the ones that I think will be awesome:


Nov 3 (NYC Predictive Analytics) Hidden Markov Models in a Nutshell


Description: Hidden Markov Models (HMMs) have emerged as a powerful paradigm for modeling stochastic processes and pattern sequences. Originally, HMMs have been applied to the domain of speech recognition, and became the dominating technology. In recent years, they have attracted growing interest in automatic target detection and classification, computational molecular biology, bioinformatics, computational finance, mine detection, handwritten character/word recognition, and other computer vision applications. The purpose of this talk is to define HMM and its categories, present the corresponding underlying problems, and explain the step-by-step working of the most popular procedure for HMM parameter estimation: Baum-Welch algorithm.

Bio: Oualid Missaoui is researcher with Pipeline Financial Group, Inc. where he is in charge of developing data mining and pattern recognition based algorithmic trading framework. He received his Ph.D. in Computer Engineering & Science for his research in the fields of machine learning, landmine detection, and image processing, from University of Louisville (2010). He earned his engineering degree in Signal and Systems and M.Sc. in Applied Mathematics from Ecole Polytechnique de Tunisie (2003, 2005).

This group is one of my favorite groups to go to, and any time there is a talk on Markov I am there.

Nov 8 ( NYC R Meetup) Parallel R with Hadoop


R is free, open-source, and in many ways a data scientist's dream ... but it strains under new-age Big Data problems.  One solution is to use Hadoop's scalable, parallel computing framework to drive R.  In this talk, consultant and author of the forthcoming bookParallel R, Q Ethan McCallum will walk through the what, how, and why of getting R to dance with the elephant.
We will also have a lightning talk from JunHo Cho, who will introduce his tool RHive, which integrates R with Hive. 
Q recently co-wrote a book with Steve Weston of Rforeach fame. I am excited to read it when it comes out.


Nov 10 (Greater Boston R users) Teaching Statistics with Open Source Tools



Nicholas Horton, Associate Professor of Mathematics and Statistics at Smith College, will be presenting on how to ease the use of R in an academic environment. This talk is hosted by Gordon College and we know it is a bit out of town (and early in the day for some), but we hope you can attend. It will be a great talk for beginner R users or those who haven't made the switch to R, but want to!
Summary: Professor Horton will demonstrate the use of the mosaic package, which was created with instructors and students in mind, and to help facilitate the use of modeling in introductory statistics, science and calculus courses. He'll give an overview of these systems for use in introductory statistics courses and undergraduate research projects. No prior experience with R or the mosaic package necessary. Minor refreshments will be provided.
I am always interested in how people approach things with steep learning curves.


Nov 14 (DC R User Group) Moneyball Meets R: Sabermetrics with the MLB Pitch Data Set by Mike Driscoll


For our next meetup we'll have some fun with Mike Driscoll (fellow Data and R Geek, organizer of the Bay Area R meetup group, CTO of MetaMarketsO'Reilly Strata/OSCON speaker, and author of the"The Three Sexy Skills of Data Geeks" blog post) while he talks about the validation of Bill James’ sabermetrics approach to batting performance using 30 years of Major League Baseball statistics, and a derived predictor for batters’ salaries using R.
He will highlight R’s functional programming features, its compact syntax for statistical modeling, and its ease of connectivity with persistent data stores. This talk will emphasize techniques and approach over detail. 
I am a huge sabermetrics and Mike Driscoll fan. I saw Mike speak most recently at Strata NYC where he probably skipped the sabermetrics stuff because both the Mets and Yankees were already out of it.

Nov 15 (Boston Predictive Analytics) Big Data and Hadoop: Applications from Enterprises and Individuals


6:30 - 6:50:  Overview of Big Data and Hadoop:  Jeffrey Kelly, who is an industry analyst covering Big Data, will be presenting the state of the industry.  In addition to early adopting web-based companies, he will be covering a variety of "use cases" that are now occurring across more industry verticals.

6:50 - 7:00:  Web/Mobile and Big Data:  Sanjay Vakil, who is a technology manager at Trip Advisor, will be presenting past and current Big Data projects that their team have been developing.

7:00 - 7:30:  Enterprise Case Studies:  Rob Lancaster and Patrick Angeles of Cloudera, a company which provides enterprise solutions that extends upon Hadoop functionality, will be presenting a high-level overview of big data and associated applications.  Secondly, they will be presenting a variety of "use cases" including diving into technical details of Hadoop and related software.

7:30 - 8:00:  "Open Data" Project:  Satish Gopalakrishnan and Vineet Manohar will be presenting their Wikipedia / Hadoop project which they created as part of the Hack/Reduce event this past summer at Microsoft NERD.  Their computer program was voted the coolest hack using Hadoop with open data.

I love this short talk format and Hadoop is the hot buzz word of the year.





Monday, October 24, 2011

St. Louis Cardinals have the formula to beat the Texas Rangers

In most cases pitching wins a series. That is especially true for the World Series. This year the St. Louis Cardinals are in the World Series against the Texas Rangers not because of their pitchers, but because of how they have managed their pitchers. However, in order to beat the bats of the Texas Rangers, the St. Louis Cardinals are going to have to take their approach to pitching to the next logical level. The problem is that tradition will be standing the way.

Baseball has a long tradition of honoring the starting pitcher. In fact the idea of relief pitching is a relatively new concept. These traditions are so deep in baseball that a starting pitcher only earns the win if he pitches five innings while that rule does not always hold true for a reliever. In some cases a starting pitcher's compensation is even tied to the number of wins he has. However, maximizing starting pitchers wins may reduce the overall wins for the team particularly in the St. Louis Cardinals' case.

It has long been known that relief pitchers have an average ERA about .5 better than starting pitching. Often the reasons given for this is that starting pitchers must pace themselves while relievers do not. If pacing themselves results in the better pitcher having worse numbers than the weaker pitcher why is this a good way to play the game? Over the course of the season this difference would account for roughly 80 additional runs being allowed or according to Bill James's Pythagorean expectation 9 fewer wins in a season. That is a significant impact.

Throughout this year's MLB playoff season the St. Louis Cardinals have been most successful when they have used a lot of pitchers for a few innings each in a game. Yes, there is the notable exception of Carpenter's  innings which was impressive but not the most likely path to victory.

Below is a quick breakout of the ERA of the Cardinals when their pitch 3 innings or less versus when pitchers ERA in the fourth inning or greater.


Series ERA <=3 Innings ERA >3 innings
STL V PHI 3.64 6.5
STL V MIL 3.66 10.5
STL V TEX 2.16 6.3

To me this analysis would indicate the best chance the Cardinals have of winning the World Series is to never pitch a pitcher more than 3 innings against the Texas Rangers. 

Thursday, October 20, 2011

The best talk at Predictive Analytics World NYC 2011 happened at night

I have enjoyed my week at Predictive Analytics World in New York. There have been some good talks, but I was missing the samples of code and concrete examples that I am used to from the simply outstanding geek gatherings ( called Meetups) that I go to in New York City all the time.

Professor Jay Emerson, Dr Awesomeness, satisfied that need with a simply outstanding talk last night to around 300 hardy PAWs members who finished off a 14 hour day with Jay's talk. It was worth it.

Jay's talk had everything from humor, to examples, to sample code, to usable advise. I just want to thank Jay again for making the effort to entertain and educate us all.

Link to Dr Jay Emerson's slides


Tuesday, October 18, 2011

Floop, the iPhone polling App, appears on ABC news in advance of Presidential Debates

Today on the ABC News show Topline they did an interview with Richard Schultz on the iPhone polling App, Floop. Why is a Politics show covering Floop? The answer is simple. There is a Presidential Debate in Las Vegas tonight, and what better way to participate in it than with Floop?



When I have watched debates in the past I viewed in disbelief after debates as the talking heads on television totally missed the boat on what just happened, and what was important. The next day is rarely better when the canned polling from people like Quinnipiac University release their polling results that cover topics that are either too broad or uninteresting. Now I do not need them. With Floop I can do my own polls, and see the results from real people in real time. Their opinions are given in both quantitative and textual feedback.

Now is the time for the people take back our government. We will do this by communicating directly with each other on social networks and cutting out the talking heads and spin doctors who tried to shape our opinions in the past. Thomas Jefferson said a little revolution every now and then is good for a democracy. Well a revolution had arrived. An election process truly driven by the people with social networks. Tools like Floop lead the way.

MLB playoffs format gives the worst teams best shot at a World Series title?

I was going through the chat sections of my linkedin groups, and there was a thread there that talked about making the MLB playoff games a best of seven series and the World Series a best of nine series. The argument being that longer series will yield more wins for the superior teams.

I have to admit I never understood why baseball does the playoff to World series the way that they do. First they play a 162 game season to determine the eight best teams which given the number of games there is a high probability they are really the best eight teams. Then they playoff in the first round in a best of five series which gives the weaker team a better shot of winning than a seven games series. Then MLB finishes off the playoff with two rounds of a best of seven. It is an illogical progression.

If the goal is to give the best team the best shot without increasing the number of games in the playoffs the best way to do that would be to make the first two rounds a best of seven series and the World Series a best of five.

To give you an idea of the difference between a best of five series and a best of seven series lets run some numbers. In a typical MLB season the best teams win about 60% of the time while the worst teams win about 40% of the time. Using that as their excepted chance of wining a game, the chances of the weakest team in baseball beating the best team in baseball in a best of five game series is roughly 32% while in a best of seven series that team has roughly a 29% chance of winning. Yup, that is a staggering three percent improvement in the chances the better team wins the series. Interesting to the statistician but irrelevant to the typical baseball fan. 

Personally I think the playoffs should be switched to a best of seven series all the way through mostly for uniformity reasons than anything else.

Wednesday, October 12, 2011

Theo Epstein to leave Red Sox for Cubs. Welcome to the new age of Baseball

The rumor is that Theo Epstein, current general manager of the Red Sox, has agreed to a $15M deal to become President and General Manager of the Cubs. This is another step by the Chicago Cubs in building a world class sabermetric based front office in hopes of building a contender on the field. The Chicago cubs already added a major piece to this puzzle last year when they added Statistician Ari Kaplan to the front office. Ari's stuff is brilliant. The first Sabermetric book I ever read was one he co-authored called Baseball Hacks. I like the way the Chicago Cubs are heading here and wish the best  of luck to them in the future. Of course, statistically speaking is there really such a thing as luck?

Tuesday, October 11, 2011

Going to Predictive Analytics World NYC for Big Data and Rstats

For the week of October 16th I am going down to NYC to immerse myself in more cool classes, speakers and forums at Predictive Analytics World (PAWs). This conference brings together some of the people whose work I just love, and I know very well in some cases and not at all in others. Below are some of the highlights of the conference in my eyes which follows pretty closely to the talks and class I will be going to.

Max Kuhn, an heavy R user, will be giving an R bootcamp and a predictive analytics in R class. I will be attending at least one of these. Max has created some heavily used R packages including Caret and ODFweave. Every presentation I have seen covering the use of R in analytics competitions like Kaggle.com start with using Max's Caret Package.

Matthew Flynn who is the Director of Claim Research at Travelers in Hartford is giving a talk on creating more analytical bandwidth with R. I believe Travelers is a big SAS user which seems to match up with Matthew's bio so it will be interesting to hear his views on R particularly if it has to do with using the R connector to SAS. However, I am always excited to find another R user in Connecticut! In that light here is an open offer to Matthew or any other R user in Connecticut to join the Connecticut R users meetup, and I promise the first round will be on me.

Anthony Goldbloom, CEO of Kaggle, is giving a talk on Predictive Modeling Competitions. I have been to a number of Anthony's talks, and they are great fun. I have also been to a number of talks by people who have competed in Kaggle competitions, and the approaches and results are simply mind blowing. Kaggle has helped take Predictive Modeling to the next level. It provides a fun environment where Data scientist can push their art and skill against other talented people. Truely great stuff.

Usama Fayyad who is the CTO of ChoozOn is giving a talk on Predictive Analytics and Big Data. Given that Usama was the Chief Data Scientist at Yahoo, and I am sure that he still plays in the REALLY big data pool, I am dying to hear his approaches and views on attacking the various problems.

John Elder of Elder Research is giving a talk on the best and the worst of Predictive Analytics. I love these types of talks. When the mistakes are presented to us they are amusing, but I promise you I will not make those those mistakes in the future because I do not want to be in one of John's talks.

Robert Broughton will be giving a talk on Predictive Analytics and law enforcement. I have seen more on more articles and blog posts on this area in the last year. I believe it has huge potential to reduce crime and make the world a safer place. I am eager to hear this talk.

The are more great speakers that may interest you more than the ones that interest me, but my dance card is full and I am tired of typing so I will just give you the link to the list of speakers.

Wednesday, October 5, 2011

Echoing the call to help Data Without Borders

While the job of the data scientist can include increasing web traffic, clicks, and maximizing revenue. I have always found this area of application to be unsatisfying. When we first formed Revolution Computing our slogan was "we cure cancer". We wanted to run a profitable business, but it was also critical to us that the business result in positive impacts on peoples lives and on society in general.  I like to think that the Open Source community and particularly the R community keeps that idea as one of its driving motivations.

At Strata NYC 2011 Robert Kirkpatrick of UN Global Pulse asked for help from the Data Scientist community to make the world a better place, and offered examples of how our community could help the world community.

I enjoyed this talk, but it was a request for help.

On the final day of Strata NYC Drew Conway and Jake Porway launched Data Without Borders and went into action. They are going to run DataDives with some of the best Data Scientists around as volunteers to provide value and insight to NGOs and nonprofits to improve all aspects of their operations. Here is a video of that presentation:
The results so far is an oversubscribed DataDive in NYC and another DataDive in San Francisco filling up fast. That is what I love about this community. They give freely of their time and knowledge to better the world around them. The best people are people of action not words.

Another article on Data Without Borders. It is also good to see that David Smith and Revolution Analytics continues with the idea of supporting efforts for the betterment of society.