Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: analytics
Showing posts with label analytics. Show all posts
Showing posts with label analytics. Show all posts

Friday, May 20, 2011

It is a Sabermetrics Weekend so todays post is Sabermetrics

This weekend I am going to the Sabermetrics Seminar in Boston. Some might think that it is strange that I am excited about this given that I never played baseball, and I do not watch many games. However, the analytics being done is baseball is developing and expanding at such a rapid pace there is no way you can enjoy analytics and not be interested. Recently a friend of mine ran into Prof. Bertsimas and asked if he could have a copy of the now famous paper that he wrote predicting the Red Sox would win 100 games this year. Prof Bertsimas asked "are you a fan of baseball?" to which he responded "No, I am a fan of statistics". 

The development of Sabermetrics in the last 30 years has been to look at existing data and try to build predictive models out of that data. It was a good first step and produced some good results. This work revealed that some of the historical statistics, like ERA, were not good predictors of anything so Sabermetricians created statistics that were better predictors. This is all great, and it has taken Sabermetrics to where it is today.

The problem with the data that has been used today in baseball is that it is all result based data. The pitcher threw a strike or a ball, the batter got on base, etc. That is all changing. Welcome to the world of physical data in Baseball. This post on Beyond the Boxscore is a good example. It has taken the improvement of a players performance back to the physical location of his pitch not just that more of his pitches resulted in ground balls, but an attempt to answer why based on data not opinion. The technology exists not only to track data of a baseball as it crosses the plate but within the entire ballpark. First this is going to create an unbelievable amount of data that needs to be in studied in ways not currently used in baseball because of shear volume. Second this data is collected in real time which means the models could be updated in real time. Billy Bean may have had his 3X5 note card in front him, but the manager of the future may be holding his iPad with feedback on up to the last pitch and the suggested options with predicted results of those options.

One of the companies doing this physical data collection in baseball is Trackman. They also recently posted for an R developer. I can not wait to see what is coming!

Wednesday, April 20, 2011

Packbots are the Brain Child of a great guy

When I was a sophomore in college I could not find an apartment so a friend of mine let me stay in his apartment for most of the first term. The same guy helped me write a forecasting program that was my senior project two years later. He was a great guy who was always there. His name is Todd Pack. When I meet him the only thing he wanted to do was build robots. He spent much of his free time either writing code on his computer ( named "Bear") or building robots in the lab. It is what he loved to do.

Todd got his PhD out of Vanderbilt with a thesis on IMA, and took a job at iRobot. Yup, they make the Roomba vacuum cleaner. It is my understanding the reason the Roomba was created was because in the early 1990s there were no government contracts for robots so they needed a consumer product to survive.

I haven't seen or heard from Todd in a number of years. Then about a month a ago I went to a talk by Yann LeCun on vision and learning algorithms for robots. The videos of the robots reminded me of the work Todd Pack did in the labs back at Cornell all those years ago. Yesterday I read an article saying that the US was sending Robots to Japan to help out at the Fukaushima Nuclear Power Plant (article). It turns out the robots come from Dr Pack's company iRobot, and I am sure he had a hand in their design. I am so happy that Todd has created a career doing what he loves, and his work absolutely saves lives. I have my guess why they call them Packbots, but it is only a guess.

Now if we could only make robots that do not look like Johnny number 5.

Here is a recent video on MSNBC featuring the iRobot's Packbots.

Tuesday, April 19, 2011

Home Field Advantage

With the Red Sox having their only wins so far this year at home along with their reputation of being tooled to perform better at Fenway, I decided to take a quick look at the American League teams Home versus Away wins in the Red Sox Bill James era (2003-2010). Listed below are the differences between home wins and away wins for each team per year in the time period along with their corresponding mean and Standard Deviation.

2010 2009 2008 2007 2006 2005 2004 2003 Mean STD
Tampa Bay 2 20 17 8 21 13 12 19 14.00 6.59
Boston 3 17 17 6 10 13 12 11 11.13 4.88
NYYanks 11 11 17 10 3 11 13 -1 9.38 5.71
Toronto 5 13 18 15 13 6 13 -4 9.88 7.10
Baltiore 8 14 7 1 10 -2 -2 9 5.63 5.93
Chi White Sox 2 7 19 4 8 -5 19 16 8.75 8.65
Minnesota 12 11 18 3 12 7 6 6 9.38 4.78
Cleveland 7 5 19 6 10 -7 8 8 7.00 7.13
Kansas City 9 16 1 1 6 7 8 -3 5.63 5.90
Detroit 23 16 6 2 -3 12 4 3 7.88 8.51
LA Angels 6 1 0 14 1 3 -2 13 4.50 6.02
Texans 12 9 1 19 -2 7 13 15 9.25 7.07
Oakland 6 11 11 4 5 -3 13 18 8.13 6.47
Seattle 9 5 9 10 10 9 13 7 9.00 2.33











































It is no surprise that the Red Sox have a strong Home/Away Wins record (2nd),












but I was surprised that the Tampa Bay Devil Rays were even better in this









statistic. Also note how dominate the AL East is in Home Wins versus Away Wins.









Now for the Red Sox in 2011 having all your wins at home is still bad.






















































































Friday, April 15, 2011

Would we have found out San Diego Basketball through Analytics

This week the San Diego State Basketball program had a number of its players arrested for shaving points in a game in February of 2010 and trying to do the same thing in a game against UCR in February of 2011 (News Story). It appears that these guys were caught by human intelligence. This type of cheating is bad for both the sports organizations (NCAA, NFL, NBA, MLB) and the major betting community. It is bad for the sports organizations because who, except WWE fans, are going to watch fixed match. It is bad for the betting community because they really make their money on the Juice they charge gamblers. The betting community wants fair games that split the money evenly over the spread. Anything that shifts that is a problem for their business model. People believing that the games are fixed could reduce the amount of money bet on games. Also bad for the Bookies.

In the book Freakomonics by Levitt and Dubner, they expose match fixing in sumo matches using statistical analysis. It was a fun read and showed that analytics have the ability to expose cheating in sports without human intelligence. It was also a safe sport to look at because Americans do not really care about sumo nor do they bet on it.

It is interesting to me that there exists so much data and analysis with a goal of prediction on sport, but I have found nothing on using predictive analytics to discover point shaving or game fixing. I realize that doing this kind of work in team sports would be more completed than something like sumo. I just feel that it might be another tool to add to the effort to deter this kind of problem. At first pass it seems the most likely times there is potential cheating is when a players statistics in a game are an outlyer, and the team did not beat the spread. The problem that I see is the sparsity of known point shaving in games. For example I only know of one alleged fixed game in the NCAA basketball season in 2010 out of something like 5,000 games. I do not think that is enough to be useful. Sad to say that if there were more fixed games we might be able to build a better model. Someone suggest to me that I would get better data on game fixing if I looked at Italian football. However, I call it soccer and care Italian football about as much as I do about sumo.

So I have no data to present or model to put forth. I just hate cheaters, and this story has bothered me since it came out on April 11th.

Tuesday, April 12, 2011

Analytics, Sabermetrics, Data Mining...Why can't we all just get along?

Sabermetrics was a term coined by Bill James to describe the analysis of baseball through objective evidence. Saber, or more accurately SABR, stands for the Society for American Baseball Research. With Bill James as its advocate. Sabermetrics has changed the way baseball is played. No easy task in a sport so encumbered by tradition. Baseball probably collects more data during a game than any other sport and each team plays at least 162 games a year. Rich data territory compared to the 16 regular season games played in the NFL.  Sabermetrics has taken a hard look at the core beliefs of what statistics make a good baseball player or team and runs them against the cold judgement of analytics. The results showed that some previously treasured statistics like batting average were not as important statistics as once thought, but others like on base percentage were better indicators. This is predictive analytics at it best. So it is time to call Sabermatrics what it is analytics.

It is funny for all the impact Sabermetrics has had on baseball I believe it is still limited by the traditions of baseball. Let me give you some examples.

The Blog Sabermetic Research talks about Buck Showalter changing the way his base runners play to gain 5 runs per year which he claims is worth $10 million dollars. Makes sense if the data he is using is good, but the key here is the decision is claimed to be made solely on the numbers.

Pitching is another story. In baseball a starting pitcher must pitch five full innings in order to earn a decision (win/loss).  Many talk about the difference between ERAs of starting versus relief pitchers. The data clearly shows that relief pitchers, even when they are the same person,  have an overall ERA .50 lower than starting pitcher or better. Tango on Baseball touches on the subject in this article. My question is that if relief pitchers have a better ERA than stating pitchers, and starters are generally accepted to be better pitchers than relievers why aren't starters being used like relievers? The impact would be huge! A quick pass says this .50 ERA reduction in starting pitchers would result in 40 less runs allowed by a team over the course of a season! Using Showalter math that is $80 million dollars. I believe the reason that this is not looked at as a solution is because of tradition. If starting pitchers where used like relievers they would never pitcher 5 innings, and therefore would never get  a decision. This would be a fundamental change in the way baseball is played.

In defense of Sabermetricians, there has been some discussion that ERA, like BA, is not a very useful statistic. This would mean that conclusions drawn from those statistics may not be as useful as they appear. I have not seen anything on starters versus relievers in terms of CERA, dERA, DICE or DIPS.

Monday, April 11, 2011

The Most Boring day in the Last hundred Years

I was driving in to work this morning and listening to the radio because I feel that distractions make me a better driver. A news article comes on telling me that a Cambridge researcher has found out that April 11, 1954 is the most boring day in the last 100+ years. I had a good laugh thinking that it isn't only the US government that gives out silly grants for pointless research ( remember the which came first the chicken or the egg paper).

So I dig a little deeper. Turns out this is a software guy who just launched his "smart" search engine. I could not find any specifics on how the engine works which would have been cool, but then I thought this is maybe even cooler than another algorithm. Here is a guy who got this new web site rolled out as a news story in three major newspapers and NPR just by asking an interesting question about the most boring day.  It would be really amazing if the search engine gave him the question after he queried "the most likely answer to a question to be picked up by major news organizations".

The Most Boring Day Article
True Knowledge
Tunstall-Pedoe and his search engine

Monday, April 4, 2011

Why are the Red Sox better today? Sabremetrics or Construction?

I saw an article this morning from an MIT professor that predicted the Red Sox would win 100 games this year. That is a pretty bold statement since the Red Sox have only won 100 games in a season three times (1912, 1915 and 1946). However, it got me to wondering how have the Red Sox become so good in recent history. I often heard comments the claim that it is the payroll or the genius of Theo Epstein. Whenever I am with statistics guys, it is the hiring of Bill James and the use of Sabremetrics that made the difference. I have a third theory to put forth as the major reason for the improvement of the Red Sox in recent history, construction at Fenway. Oddly, this started the same year that Bill James was hired by the Red Sox, 2003.

From 1995 to 2002 the Red Sox had a combined record of 695-582 winning 54.42% of their games. From 2003 to 2010 the Red Sox had a combined record of 749-547 winning 57.79% of their games.


Year W L Winning % Year W L Winning %
2010 89 73 54.94% 2002 93 69 57.41%
2009 95 67 58.64% 2001 82 79 50.93%
2008 95 67 58.64% 2000 85 77 52.47%
2007 96 66 59.26% 1999 94 68 58.02%
2006 86 76 53.09% 1998 92 70 56.79%
2005 95 67 58.64% 1997 78 84 48.15%
2004 98 64 60.49% 1996 85 77 52.47%
2003 95 67 58.64% 1995 86 58 59.72%
749 547 57.79% 695 582 54.42%


So they are a got better after 2003 and Theo is a genius and Sabremetrics rules baseball. I am not so sure, and I think we reach those numbers based on a Simpson's paradox. Let me explain. If Sabremetrics had been the driving reason for the improvement the Red Sox. they would have gotten better not only at home but away as well. They did not. In fact the Red Sox improved massively at home, but got worse on the road. So what is the factor that explains this? In 2003, the same year Bill James was hired by the Red Sox, additional seating was added the Fenway park for the first time since it was 1946. While it was was always known that Fenway was helpful to certain types of hitters and pitchers and the Red Sox teams have always emphasized those players. I believe that construction made the park even more baised than it was before.



During the period 1995 to 2002 the Red Sox had a better away record than they did from 2003-2010.



Away Record
W L % W L %
2010 40 41 49.38% 2002 51 30 62.96%
2009 39 42 48.15% 2001 41 39 51.25%
2008 39 42 48.15% 2000 43 38 53.09%
2007 40 41 49.38% 1999 45 36 55.56%
2006 35 46 43.21% 1998 41 40 50.62%
2005 41 40 50.62% 1997 39 42 48.15%
2004 43 38 53.09% 1996 38 43 46.91%
2003 42 39 51.85% 1995 43 28 60.56%
total 319 329 49.23% 341 296 53.53%








For Home games it is a very Different story:



Home Record






W L %

W L %
2010 49 32 60.49%
2002 42 39 51.85%
2009 56 25 69.14%
2001 41 40 50.62%
2008 56 25 69.14%
2000 42 39 51.85%
2007 56 25 69.14%
1999 49 32 60.49%
2006 51 30 62.96%
1998 51 30 62.96%
2005 54 27 66.67%
1997 39 42 48.15%
2004 55 26 67.90%
1996 47 34 58.02%
2003 53 28 65.43%
1995 43 30 58.90%
total 430 218 66.36%

354 286 55.31%






























































































MIT economist says Red Sox will win 100 games in 2011