Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: baseball
Showing posts with label baseball. Show all posts
Showing posts with label baseball. Show all posts

Tuesday, July 5, 2011

Does Attendence effect Winning in MLB

I started messing with this idea over the weekend because I like to watch TV and play with the computer at the same time. I was looking at various stats on MLB on the ESPN website, and I ended up looking at the various teams home field attendance numbers. There really are few surprises in the top ten, half are American League and half are National League with all except the Twins, Cubs and Dodgers having good seasons.

The bottom ten was more of a surprise to me. with 70% of those teams with poor attendance from the American League and Division Leader Cleveland Indians right there at 26th place. The Indians draw poorly both home and anyway.

This got me thinking if there is an impact of the mythical 11th man (or 10th man in the National League). So I thought I would look at if variations in attendance impacts the Indians performance on the field. The spreadsheet below shows the results for the first half of the 2011 season. It is interesting to note that the Indians have a much better record when attendance is less than 10,000 (6-1) or more than 30,000 (7-1) than they do when attendance is between 10,000-20,000 (10-7) or 20,000-30,000 (4-2). Correlation is not causality, but at least for this year the Indians do better when the 11th man shows or stays home, and worse when he kinda shows.


Attendance Runs scored Runs Allowed Run Diff
8726 7 1 6
9025 3 1 2
9076 8 2 6
9523 8 4 4
9650 9 4 5
9722 7 2 5
9853 3 8 -5
10594 1 0 1
10714 8 3 5
13017 4 2 2
13551 5 4 1
14164 5 4 1
15224 7 8 -1
15278 4 6 -2
15336 2 11 -9
15397 4 7 -3
15498 1 0 1
15568 9 5 4
15849 2 3 -1
15877 3 4 -1
16336 2 8 -6
16346 8 2 6
17568 4 3 1
18107 4 7 -3
19225 3 2 1
20261 0 2 -2
23752 2 4 -2
26408 2 14 -12
26433 3 2 1
26833 12 4 8
27458 0 4 -4
30023 5 2 3
31622 5 4 1
31865 5 1 4
33774 5 4 1
38549 5 1 4
40631 2 1 1
40676 6 3 3
41721 10 15 -5 

Monday, May 23, 2011

Sabermetrics Seminar

I went to the Sabermetrics Seminar at Harvard this weekend. It was a charity event, and all the speakers came and talked on their own dime. I just want to thank those speakers for giving up their time for such a great cause.

The Seminar itself was an eye opening experience for me. The last seminar I went to was the R/Finance in Chicago. That Seminar, like most that I go to, is for hard core statisticians and computer scientists. I believe of the hundreds of attendees to R/finance I am one of the few without a PhD.  The presentations with the possible few exceptions of JD Long's honoring of Dr Suess were of a highly technicial level. The Sabermatrics Seminar was totally different. The audience varied from the Head of the Harvard Statistics Department and an eminent physicist to people with very limited mathamatical education. The presentations also ran the gambit from something that would be taught in a high school physics class to some fairly high level stuff. The great unifier in the room was these people loved baseball and where using mathamatics to expand their understanding of the game and increase their enjoyment. One Speaker, Dan Duquette, former GM of the Boston Red Sox, reminded us of the words of Flippe Alou to "remember to enjoy the game". Tom Tippet, Director of Baseball Information Systems, gave a great Q&A on the state of Sabermetrics in MLB today. I have included a link to a summary of the seminar here.

Sabermetrics is different than the other fields I work in. In Pharma, the models are widely shared, but the data is highly confidential. In Finance the models are confidential, but the data is basically public. MLB analysts seem to strongly guard both their models and there data viewing both as propietary. While I think this makes it a great opportunity for consulting, I believe it may hinder the rate of refinement. Kaggle has shown in a very public way that open collaboration on data and models yields astounding improvements in prediction.

Friday, May 20, 2011

It is a Sabermetrics Weekend so todays post is Sabermetrics

This weekend I am going to the Sabermetrics Seminar in Boston. Some might think that it is strange that I am excited about this given that I never played baseball, and I do not watch many games. However, the analytics being done is baseball is developing and expanding at such a rapid pace there is no way you can enjoy analytics and not be interested. Recently a friend of mine ran into Prof. Bertsimas and asked if he could have a copy of the now famous paper that he wrote predicting the Red Sox would win 100 games this year. Prof Bertsimas asked "are you a fan of baseball?" to which he responded "No, I am a fan of statistics". 

The development of Sabermetrics in the last 30 years has been to look at existing data and try to build predictive models out of that data. It was a good first step and produced some good results. This work revealed that some of the historical statistics, like ERA, were not good predictors of anything so Sabermetricians created statistics that were better predictors. This is all great, and it has taken Sabermetrics to where it is today.

The problem with the data that has been used today in baseball is that it is all result based data. The pitcher threw a strike or a ball, the batter got on base, etc. That is all changing. Welcome to the world of physical data in Baseball. This post on Beyond the Boxscore is a good example. It has taken the improvement of a players performance back to the physical location of his pitch not just that more of his pitches resulted in ground balls, but an attempt to answer why based on data not opinion. The technology exists not only to track data of a baseball as it crosses the plate but within the entire ballpark. First this is going to create an unbelievable amount of data that needs to be in studied in ways not currently used in baseball because of shear volume. Second this data is collected in real time which means the models could be updated in real time. Billy Bean may have had his 3X5 note card in front him, but the manager of the future may be holding his iPad with feedback on up to the last pitch and the suggested options with predicted results of those options.

One of the companies doing this physical data collection in baseball is Trackman. They also recently posted for an R developer. I can not wait to see what is coming!

Tuesday, April 12, 2011

Analytics, Sabermetrics, Data Mining...Why can't we all just get along?

Sabermetrics was a term coined by Bill James to describe the analysis of baseball through objective evidence. Saber, or more accurately SABR, stands for the Society for American Baseball Research. With Bill James as its advocate. Sabermetrics has changed the way baseball is played. No easy task in a sport so encumbered by tradition. Baseball probably collects more data during a game than any other sport and each team plays at least 162 games a year. Rich data territory compared to the 16 regular season games played in the NFL.  Sabermetrics has taken a hard look at the core beliefs of what statistics make a good baseball player or team and runs them against the cold judgement of analytics. The results showed that some previously treasured statistics like batting average were not as important statistics as once thought, but others like on base percentage were better indicators. This is predictive analytics at it best. So it is time to call Sabermatrics what it is analytics.

It is funny for all the impact Sabermetrics has had on baseball I believe it is still limited by the traditions of baseball. Let me give you some examples.

The Blog Sabermetic Research talks about Buck Showalter changing the way his base runners play to gain 5 runs per year which he claims is worth $10 million dollars. Makes sense if the data he is using is good, but the key here is the decision is claimed to be made solely on the numbers.

Pitching is another story. In baseball a starting pitcher must pitch five full innings in order to earn a decision (win/loss).  Many talk about the difference between ERAs of starting versus relief pitchers. The data clearly shows that relief pitchers, even when they are the same person,  have an overall ERA .50 lower than starting pitcher or better. Tango on Baseball touches on the subject in this article. My question is that if relief pitchers have a better ERA than stating pitchers, and starters are generally accepted to be better pitchers than relievers why aren't starters being used like relievers? The impact would be huge! A quick pass says this .50 ERA reduction in starting pitchers would result in 40 less runs allowed by a team over the course of a season! Using Showalter math that is $80 million dollars. I believe the reason that this is not looked at as a solution is because of tradition. If starting pitchers where used like relievers they would never pitcher 5 innings, and therefore would never get  a decision. This would be a fundamental change in the way baseball is played.

In defense of Sabermetricians, there has been some discussion that ERA, like BA, is not a very useful statistic. This would mean that conclusions drawn from those statistics may not be as useful as they appear. I have not seen anything on starters versus relievers in terms of CERA, dERA, DICE or DIPS.

Monday, April 4, 2011

Why are the Red Sox better today? Sabremetrics or Construction?

I saw an article this morning from an MIT professor that predicted the Red Sox would win 100 games this year. That is a pretty bold statement since the Red Sox have only won 100 games in a season three times (1912, 1915 and 1946). However, it got me to wondering how have the Red Sox become so good in recent history. I often heard comments the claim that it is the payroll or the genius of Theo Epstein. Whenever I am with statistics guys, it is the hiring of Bill James and the use of Sabremetrics that made the difference. I have a third theory to put forth as the major reason for the improvement of the Red Sox in recent history, construction at Fenway. Oddly, this started the same year that Bill James was hired by the Red Sox, 2003.

From 1995 to 2002 the Red Sox had a combined record of 695-582 winning 54.42% of their games. From 2003 to 2010 the Red Sox had a combined record of 749-547 winning 57.79% of their games.


Year W L Winning % Year W L Winning %
2010 89 73 54.94% 2002 93 69 57.41%
2009 95 67 58.64% 2001 82 79 50.93%
2008 95 67 58.64% 2000 85 77 52.47%
2007 96 66 59.26% 1999 94 68 58.02%
2006 86 76 53.09% 1998 92 70 56.79%
2005 95 67 58.64% 1997 78 84 48.15%
2004 98 64 60.49% 1996 85 77 52.47%
2003 95 67 58.64% 1995 86 58 59.72%
749 547 57.79% 695 582 54.42%


So they are a got better after 2003 and Theo is a genius and Sabremetrics rules baseball. I am not so sure, and I think we reach those numbers based on a Simpson's paradox. Let me explain. If Sabremetrics had been the driving reason for the improvement the Red Sox. they would have gotten better not only at home but away as well. They did not. In fact the Red Sox improved massively at home, but got worse on the road. So what is the factor that explains this? In 2003, the same year Bill James was hired by the Red Sox, additional seating was added the Fenway park for the first time since it was 1946. While it was was always known that Fenway was helpful to certain types of hitters and pitchers and the Red Sox teams have always emphasized those players. I believe that construction made the park even more baised than it was before.



During the period 1995 to 2002 the Red Sox had a better away record than they did from 2003-2010.



Away Record
W L % W L %
2010 40 41 49.38% 2002 51 30 62.96%
2009 39 42 48.15% 2001 41 39 51.25%
2008 39 42 48.15% 2000 43 38 53.09%
2007 40 41 49.38% 1999 45 36 55.56%
2006 35 46 43.21% 1998 41 40 50.62%
2005 41 40 50.62% 1997 39 42 48.15%
2004 43 38 53.09% 1996 38 43 46.91%
2003 42 39 51.85% 1995 43 28 60.56%
total 319 329 49.23% 341 296 53.53%








For Home games it is a very Different story:



Home Record






W L %

W L %
2010 49 32 60.49%
2002 42 39 51.85%
2009 56 25 69.14%
2001 41 40 50.62%
2008 56 25 69.14%
2000 42 39 51.85%
2007 56 25 69.14%
1999 49 32 60.49%
2006 51 30 62.96%
1998 51 30 62.96%
2005 54 27 66.67%
1997 39 42 48.15%
2004 55 26 67.90%
1996 47 34 58.02%
2003 53 28 65.43%
1995 43 30 58.90%
total 430 218 66.36%

354 286 55.31%






























































































MIT economist says Red Sox will win 100 games in 2011