Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: March 2011

Thursday, March 31, 2011

So how busted is your NCAA Basketball Bracket.

I did a great job this year and selected none of the final four teams in the NCAA tournament. Since a rough estimate (assuming each game is basically a coin flip) of the chance of getting all four teams right is roughly 1 in 10^21 I did not feel too bad. I wondered how the general population of bracketologists did. Out of the roughly 6 million players on ESPN's tournament challenge only 2 got it right. I shouldn't say only two because they did something like a quadrillion times better than the rough estimate. This would suggest that is are much better ways to model the NCAAs than as series of coin flips because it assumes that each team has an equal chance of winning each game and that is just not realistic. If it was there would be a lot of broke bookies, and there are not.

So I looked at some betting lines at the beginning of the tournament and that gave me a odds of this final four of 1 in 10^8. I later found a betting site that came up with roughly the same numbers. However, I still see the gap between the Results of the ESPN group versus the betting line too great. It makes me think that there is an additional piece of information needed to account for the fact that this is a tournament and not just a series of games. Looking at the various season schedules and results I think the can be an argument made that teams with steaks of 5 or more games have an advantage in the tournament, but how to account for multiple streaks or in KU case one really long winning streak is bogging me down. I will try this a model of using betting odds weighted by in season streaks next year. Hey, I can not do worse then 0 for 4.

http://games.espn.go.com/tcmen/en/frontpage?addata=2011_fantasy_mega_tcmen
http://www.predictionmachine.com/Predictalator/predictions.aspx

Tuesday, March 29, 2011

What happens when you know you are being watched?

A couple of weeks ago a friend convinced me to add foursquare to my phone. I had been reluctant to add the app because I felt it was simply providing the company with free data to analyze and market. I gave in and added it anyway. However, I wasn't about to enter only my real check ins. I mean why give foursquare clean data to play with? So I checked into all sorts of odd places. Within a day my friend emailed me to be careful about where a checked in. It got me to wonder why? I mean who cares? About a week later I came across a tweet by SEOGoblin about being a surveillance society. It followed along the same lines of every predictive analytics talk I have been to in the last two years. These massive collections of data is enabling company to better predict our actions etc..

I began to wonder what was the impact of so much personal data being collected, and most people knowing they are being watch or analyzed?  I know I have changed my behavior because I know that data is being collected. I am willing to bet so have most other people. Which led me to the idea that all the data being collected on the internet or in social media sites like foursquare are feeling the influence of the Hawthorne effect. The Hawthorne effect says that people adjust their behavior simply because they know they are being studied. Usually the way you account for this is by having a control group, but where is a control group when the population is internet traffic, and everyone knows they are being monitored?

The impact of this is clear. If people are changing their behaviors because they are being studied or providing information that they believe will provide positive results to those who are studying them, then the predictive analytics that is produced from that data could be distorted or misleading.




http://www.goodgearguide.com.au/article/380932/big_data_drive_surveillance_society/
http://en.wikipedia.org/wiki/Hawthorne_effect

Thursday, March 24, 2011

Growth of R users

In 2005 I was asked to determine he number of R users. It was like chasing ghosts. I could find lots of articles and posts about the explosive growth of R, but no numbers and no hard data. Finally I ran across a thread from some statisticians trying to determine the same thing in 2002. Their solution was to build a predictor model based on the number of unique user posts on Rhelp. I liked it, and it was the only rational idea I had seen on how to come to the total number of R users. I adopted their model and extended it until 2006. I ended up with an estimate of roughly one million users and never really revisited the subject. Lately I have seen more and more discussions about the number of  R users. I loved the paper by Robert Muechen recently posted on r4stats. Robert tries to determine the popularity of various analytics packages. His paper made me think that the most important measure of the various packages may not be the number of users or even the popularity of the package, but rather the amount of analytics done in that package. Anyway in my current job no one asks me about this topic, but I still find it interesting.

http://sites.google.com/site/r4statistics/popularity

Friday, March 18, 2011

How does beer relate to Scrapping Data?

Sometimes I find myself involved in projects as a result of having just one too many rounds of beer after a meetup. Such is the case with the beer predictor. In an overt attempt to ingratiate ourselves with the local craft brewers we thought it would be a good idea to write some analytics on craft beers to get noticed by those brewers so that they would in turn supply us with free beer. At least we had a reasonable goal in mind. I will update that project as we go along.  The first part of the project was to scrap the data from the Beer Advocate. I had not started doing this because it was St Paddy's day, and I felt the proper way to study beer on that day is consumption. Others wrote code, and finished that portion of the project. I did come across this blog post on scrapping data in R with XML versus in Python with Beautiful Soup which I thought was interesting.

http://thelogcabin.wordpress.com/2010/08/31/using-xml-package-vs-beautifulsoup/

Tuesday, March 15, 2011

Simpson paradox - My first post will always be a paradox

Simpsons's Paradox - When Big Data Sets Go Bad
It's a well accepted rule of thumb that the larger the data set, the more reliable the conclusions drawn. Simpson' paradox, however, slams a hammer down on the rule and the result is a good deal worse than a sore thumb. Unfortunately Simpson's paradox demonstrates that a great deal of care has to be taken when combining small data sets into a large one. Sometimes conclusions from the large data set are exactly the opposite of conclusion from the smaller sets.  Unfortunately, the conclusions from the large set are also usually wrong.
To understand this effect we'll use a set of simulated data. Table 1 shows the average physics grades for students in an engineering program. This is a difficult class used for weeding out weaker students. Most of these students prepared for college by taking high school (HS) physics. The data illustrates that there is a ten point advantage for those with HS physics. Table 2 shows the average physics grades for students in a liberal arts program. This class is designed as an elective course for the enrichment of students who would otherwise avoid physics. Few students have prepared for this class by taking HS physics. However, those few who took HS physics have a 10 point grade advantage. In both classes taking physics clearly produced an advantage.
We now combine the data sets. The combined results for students who took physics are shown in table 3. The average college physics grade has been determined by adding all the grade points (4475) and then dividing by the total number of students (55). Table 4 shows the same results for the students without HS physics. The results of tables 3 and 4 indicate that students who take physics perform worse than those who don't by 2.3 points. This is the opposite conclusion from the conclusion of tables 1 and 2. 
Obviously, combining the data sets gives a misleading picture but why? The answer lies in two parts. First, the data sets for the two major groups (engineering and liberal art students) were influenced by a lurking variable, course difficulty. The engineering students received a rigorous course. The liberal arts students a less demanding enrichment course. Second, the groups in the data sets were not the same size.  This caused the average of college physics grades to be weighted toward engineering student grades for those who had taken HS physics. Since the engineering students' course was more rigorous it lowered the average. The opposite was true for the combined results of those who didn't take HS physics. 
HS PhysicsNoneImprovement
Student505---
Ave Grade807010
Table 1. Average college physics grades for students in an engineering program.
HS PhysicsNoneImprovement
Student550---
Ave Grade958510
Table 2. Average college physics grades for students in a liberal arts program.


# StudentsGradesGrade Pts
Engineering50804000
Lib Arts595475
Total554475
Average---
81.4
---
Table 3. Average college physics grades for students who took high school physics.
# StudentsGradesGrade Pts
Engineering570350
Lib Arts50854250
Total4600
Average83.6
Table 4. Average college physics grades for students who didn't take high school physics.
There were four separate groups in the study as follows:
  1. Engineering students with HS physics
  2. Engineering students without HS physics
  3. Liberal arts students with HS physics
  4. Liberal arts students without HS physics
If all the four groups had been the same size, the results would have indicated that students with HS physics had a 10 point advantage in their college physics grades regardless of the type of college physics they took. Likewise if an average had been calculated which was not weighted toward group size, the results would also have  indicated the same 10 point advantage.
Conclusions
Simpson's Paradox is caused by a combination of a lurking variable and data from unequal sized groups being combined into a single data set. The unequal group sizes, in the presence of a lurking variable, can weight the results incorrectly. This can lead to seriously flawed conclusions. The obvious way to prevent it is to not combine data sets of different sizes from a diverse sources. 
Simpson's Paradox will generally not be a problem in a well designed experiment or survey if possible lurking variables are identified ahead of time and properly controlled. This includes eliminating them, holding them constant for all groups or making them part of the study.