Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: Is there really a Data Scientist shortage or are we victims of our own Predictive Analytics?

Thursday, October 27, 2011

Is there really a Data Scientist shortage or are we victims of our own Predictive Analytics?

Recently I have gone to a number of conventions like Strata NYC and Predictive Analytics World NYC. I heard the same call over and over. There is a storage of Data Scientists! It is going to get worse! We need another 190,00 Data Scientists just to fill the need! For those of you who do not know what a Data Scientist is, Mike Driscoll describes it on Quora as a blend of Red-Bull-Fueled Hacking and espresso-inspired statistics. Awesome!

I started to wonder where this number came from, and how it was developed. Why? Well, I am a Data Scientist of sorts, and I am not confident there is a real shortage of people who do this work or who can do this work. It also raises my alarm bells when I see the same presentations by different people that present the same numbers. The chance of so many people coming to exact the same numbers independently is about as likely as five people in the US dying by drink tap water ( the same chance as winning Powerball). I did a project to estimate the number of R users in 2006 at a Subway on a napkin that was re-used by countless people over the next couple of years. Thank god others have taken a more detailed look at that issue since, and people now use their numbers.

Turns out the 190,000 number comes from McKinsey Global Institute which projects the shortfall by 2018. When I found that out, I really began to question the number which had already been misquoted in most of the presentations I had seen. Some presentation had even presented the 190,000 person shortfall as a current condition rather than a projection for 2018. The term Data Scientist was first coined by Jeff Hammerbacker at Facebook in 2007. I am leary of a projection seven years out for a position that was not even named until four years ago. Reminds me of Morris's paper to predict batting averages for the season for MLB batters using their first 40 at bats. Not a very useful training set.

While I was writing this I was sent a post from Andrew Gelman's blog.  I am a firm believer that no statistics blog post is complete without an Andrew Gelman quote or post so here it is: The #1 way to lie with statistics just lie . Do not read anything into the coincidence of the quote with this post, but the timing is surprising. Besides it is a good warning to us all to let the data speak for itself, and not try to support our own opinions through use of statistics or lack thereof.

Now to the Mckinsey Report. If you are dying to read all 156 pages of the report here is the link: McKinsey Big Data Report. You will need the Red Bulls and Espressos that Mike Driscoll mentioned earlier! I will save you the time. Mckinsey talks about how they can to that number on page 134 in the appendix. I see a lot of problems. First there is no data or sample data, and there is no description of the predictive model used. Without the means to attempt to validate, I have to question if the conclusion is valid. In their brief description of what they did to come up with these numbers I already see problems. Mckinsey says their raw data is based on SOC code numbers from 2008. That is one year after the term data scientist was coined and what is required to be one has changed quite a bit sense then. A static description of a moving target may be a highly inaccurate. Second, they list the SOC codes they used to determine their population. I see an number of SOC code that Data scientists come from that are missing from the start. The most glaring one is physicist. Some of the best Data scientist in the field are physicists and there are a lot of them in the field.

Looks like we need to get a Data Scientist to look at how many Data Scientists were are going to need in the future.


  1. I thought I should add links here on what is Data Science and a Data Scientist. One is from Harlan Harris and the other from Daniel Tunkelang.

  2. The Appendix gives a description of the "predictive" model McKinsey uses to estimate both supply and demand on analytics talent.

    Here is their demand model:

    "Demand for deep analytical talent in 2018 is driven by the growth of industries that employ these people and the share of this talent employed by these industries, estimated by the percentage of each occupation within a sector that are serving in a deep analytical capacity."

    If one does not question the 'growth of industries' assumption, the rest of the model appears logical.

    The hype is magnified, IMO because these so-called data scientists are fragmented across several verticals. When bunched together the demand seems mind-boggling.

  3. I think there needs to be greater description than simply a "predictive model". I believe when a report is issued there must be enough description of the models used and the data so that one can validate the conclusions. There is simply not enough in the McKinsey report to do that.

    However, I do agree the that the position is fragmented across several verticals which makes defining the position and predicting its future demand even more challenging.

    I am a big believer in the system finds a way. I believe it is already doing that. Companies and Universities are working to create more people with the ability to do these jobs as they are currently defined. Countless software companies are working on tools to enable employees with less technical capability to take over some of the work load in this area or to push functions completely to processing with no human oversight.