Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: statistics
Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Sunday, May 15, 2011

In defense of the abuse of Statistics and their charts

I took a Statistics class in 1990 at Cornell from a guy named Lionel Weiss. He started the first class by telling us he was going to teach us to "make numbers lie". His point was that we always need to be very careful in how we looked at data, pick our models and verify our result. It sounds simple and basic that it was hard for me to believe it even needed to be said in an undergraduate class. Boy was I a naive optimist.

The thing I failed to recognize at the time was most analysis is not done to figure out what is going on, but to support a position already taken by the analyst or the person the analyst works for. The results of this would be funny if they weren't so scarey.

Andrew Gelman wrote a post on his blog recently about research papers out of China which he got from the Statistics blog forum. While no conclusions are drawn from the study it is suggested that the Chinese researchers knew the conclusion they are supposed to get and therefore get an even better result than was had before. While it would be comforting for us to say that the problem exists over there and not here, I have heard enough stories to say that is not true. I recently talked to an individual the was working on a research paper with another person. The second person developed analysis that supported one of his own beliefs. The first researcher strongly disagreed and pointed to some major problems in the other researchers analysis. The result was the first researcher had his name removed from the paper, but the second researcher publish the paper anyway.

Junk charts not only goes after visualizations that are misleading because of poor representation choices, but also those that may have been chosen to be misleading. In this second group I would put this graphic posted on Junk charts. There are so many oversights in this chart comparison, and it is hard to argue this was oversight rather than an overzealous analyst trying to support a predetermined conclusion.

On the blog Numbers rule your world I saw this post comparing the life expectancy versus the number of retirement years. This has the problem of comparing averages of two disconnected things, but it is another example of sloppy work or a graph to put forth a desired result.

So more than 20 years later Lionel was right. We can make "numbers lie". However, if we want to produce useful results we must fight that urge to produce a result that supports are own bias. We should make that extra effort to refute those findings that support our own beliefs before we publish them to the world. Let the data speak to us instead of us telling the data what to say.

Saturday, April 16, 2011

doRedis: A parallel back end for R/foreach using Redis.

Bryan Lewis recently did a Vignette using RStudio to run some financial data using R/foreach and doRedis. Parallelizing R continues to become more important as users want to run computations on larger and larger quantities of data in reasonable amounts of time particularly on clusters or the cloud. doRedis includes the following features:

  • Support for dynamic pools of parallel workers during running computations.
  • Simple cross-platform parallel computing, including at least Windows, GNU/Linux and OS X.
  • Fault-tolerant 
doRedis Vignette

Redis

Monday, April 4, 2011

Why are the Red Sox better today? Sabremetrics or Construction?

I saw an article this morning from an MIT professor that predicted the Red Sox would win 100 games this year. That is a pretty bold statement since the Red Sox have only won 100 games in a season three times (1912, 1915 and 1946). However, it got me to wondering how have the Red Sox become so good in recent history. I often heard comments the claim that it is the payroll or the genius of Theo Epstein. Whenever I am with statistics guys, it is the hiring of Bill James and the use of Sabremetrics that made the difference. I have a third theory to put forth as the major reason for the improvement of the Red Sox in recent history, construction at Fenway. Oddly, this started the same year that Bill James was hired by the Red Sox, 2003.

From 1995 to 2002 the Red Sox had a combined record of 695-582 winning 54.42% of their games. From 2003 to 2010 the Red Sox had a combined record of 749-547 winning 57.79% of their games.


Year W L Winning % Year W L Winning %
2010 89 73 54.94% 2002 93 69 57.41%
2009 95 67 58.64% 2001 82 79 50.93%
2008 95 67 58.64% 2000 85 77 52.47%
2007 96 66 59.26% 1999 94 68 58.02%
2006 86 76 53.09% 1998 92 70 56.79%
2005 95 67 58.64% 1997 78 84 48.15%
2004 98 64 60.49% 1996 85 77 52.47%
2003 95 67 58.64% 1995 86 58 59.72%
749 547 57.79% 695 582 54.42%


So they are a got better after 2003 and Theo is a genius and Sabremetrics rules baseball. I am not so sure, and I think we reach those numbers based on a Simpson's paradox. Let me explain. If Sabremetrics had been the driving reason for the improvement the Red Sox. they would have gotten better not only at home but away as well. They did not. In fact the Red Sox improved massively at home, but got worse on the road. So what is the factor that explains this? In 2003, the same year Bill James was hired by the Red Sox, additional seating was added the Fenway park for the first time since it was 1946. While it was was always known that Fenway was helpful to certain types of hitters and pitchers and the Red Sox teams have always emphasized those players. I believe that construction made the park even more baised than it was before.



During the period 1995 to 2002 the Red Sox had a better away record than they did from 2003-2010.



Away Record
W L % W L %
2010 40 41 49.38% 2002 51 30 62.96%
2009 39 42 48.15% 2001 41 39 51.25%
2008 39 42 48.15% 2000 43 38 53.09%
2007 40 41 49.38% 1999 45 36 55.56%
2006 35 46 43.21% 1998 41 40 50.62%
2005 41 40 50.62% 1997 39 42 48.15%
2004 43 38 53.09% 1996 38 43 46.91%
2003 42 39 51.85% 1995 43 28 60.56%
total 319 329 49.23% 341 296 53.53%








For Home games it is a very Different story:



Home Record






W L %

W L %
2010 49 32 60.49%
2002 42 39 51.85%
2009 56 25 69.14%
2001 41 40 50.62%
2008 56 25 69.14%
2000 42 39 51.85%
2007 56 25 69.14%
1999 49 32 60.49%
2006 51 30 62.96%
1998 51 30 62.96%
2005 54 27 66.67%
1997 39 42 48.15%
2004 55 26 67.90%
1996 47 34 58.02%
2003 53 28 65.43%
1995 43 30 58.90%
total 430 218 66.36%

354 286 55.31%






























































































MIT economist says Red Sox will win 100 games in 2011