Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: July 2011

Monday, July 25, 2011

Finding an R consultant website

I work for Big Computing, and we are an analytics and HPC consulting company. We have a website, and so do many of the other consulting firms that service analytics platforms like R. Although I am tempted to call it Rstats to make it easier to Google. The funny thing is no one visits these consulting companies websites. Truth be told we, like many of our competitors, get our customers by word of mouth. I never really thought about it unit recently.

On Friday I had a conversation with some people from Hubspot up in Boston. They provide free fun analytics tools like twittergrader and webgrader. They also sell a commercial software package to help people optimize their websites. Never really thought about doing that for Big Computing because our customers always find us so we do not really do much with the website. It got me thinking that I should spend a little time of building up the website and seeing if in the very niche world of R (#rstats) consulting if it actually yields not only traffic but customers. I will update on this periodically.

 Since this is the first post on the subject I will give you an idea of where I am starting with www.bigcomputing.com. We have a website, and it looks cool. It has been up for a while, but that is about it. It is ranked about 6,000,000th web. To put that in perspective my fun blog ranks about 500,000th on the web. Basically the only people who currently go to the company website are the people who access it directly or who made a typeO. Wish me luck.

Friday, July 22, 2011

Friday Musing - Hot temperatures results in ejections in MLB, but what happens to the game?

In a previous post I wrote about a research paper that found a relationship between hot temperatures and batters getting hit by pitches. Yesterday I came across an old article that finding that baseballs fly farther in hot temperatures ( about 2%). Finally this morning I saw a post that Managers are more likely to get ejected when the temperature rises. All this is very interesting to read when the East Coast is suffering from an oppressive heat wave. In fact I am trying to think of a way to get myself ejected to somewhere cooler than here right now. However, it got me to wonder if the games themselves where different from games played in less draining conditions.

So I looked at total runs scored in the American League East during the hottest month of the year which is July in 2011 compared to the rest of the year. Total runs scored in July was 9.7 versus 8.9 the rest of the year. This heat wave may be a factor. It absolutely a factor in why I have been watching the games on TV with the AC blasting instead of going to the ballpark.

Monday, July 18, 2011

What software is the best at Predictive Analytics Competitions? R, Salford Systems, SAS, SPSS or other?

I was reading my copy of Amstatnews this morning, and I came across an Ad for Salford Systems' SPM. In the Ad they make the claim that "Salford System' tools have dominated the fiercely contested field of data mining competitions for nearly a decade. Since 2000, no other vendor has come close to our record of consistent out-performance." I thought that was a fairly bold claim and did not mesh with my take on what platforms were being used and experiencing success in these competitions. So I decided to dig a little.

Kaggle.com which is probably the most dominate site in terms of these competitions posted their data on what software their contestants use. The most used platform is R, and Salford Systems is not even on the list. This is hardly surprising given the number of users of open source R versus the number of users of commercial Salford Systems. My understanding based on conversations with the people at Kaggle.com is that the majority of their contests are won with R. In fact according to Revolution Analytics ,which supports open source R,  R has won 50% of all Kaggle.com competitions.

So the software vendors are all claiming that they are the best. That is not a surprise. Which took me back to Kaggle.com's breakout of competitors. They use things like regression and SVM. I do not think regression on one analytics platform should yield different results than that same computation on another platform. In fact if it did I would be concerned. Maybe the credit for winning these competitions should go to the competitors who come up with the winning solutions rather than the tools they choose to use.

Saturday, July 16, 2011

Android loses developers to iPhone/iPad- Another example of why Pie Charts suck



Here is another example of the abuse of data and blurring of information by using a misleading chart. Yesterday I stumbled across a Flurry article titled iPad2 and Verizon iPhone Take Some Wind Out of Android's Sail. The article says that although Android's daily activations from 300,00 in December to 500,000 in June the platform lost developer support.  Then up comes the Pie charts for new project starts on the various platforms.

Flurry NewProjectStarts Q1vQ2 2011 resized 600


Oh my god! The sky is falling! That is a 22% decline in Android Development Projects in only 90 days! Right?  The article goes on to explain this decline in Developer support:


Studying the numbers, it’s readily apparent that Android has lost developer support to iOS. Specifically, Android new project starts have dropped from 36% in Q1 to 28% in Q2. Overall, total Flurry iOS and Android new project starts grew from 9,100 in Q1 to 10,200 in Q2.

And there is the disconnect. This 22% drop development in translates into only a 12% drop in the actual number of Android projects started in the month. A Pie Chart is just a bad choice to tell the tale, and seems an odd choice when the only factor to be graphed out is the one number for Android that did not do well compared to the iPhone/iPad, No cute graph comparing Androids 500,000 activations per day versus iPhone/iPad's 325,000 activations per day with Android activations expanding at a higher percentage and actual unit rate. Flurry may be a little fuzzy on the data.


Here are two good blog post explaining the problems with Pie Charts:
How Pie Charts Fail
Three Reason why Pie Charts Suck

Friday, July 15, 2011

Figuring out the number of R users in comparison to SAS, SPSS and all

I have wrote about this problem before and have tried to come up with various approaches to come up with a prediction.  I have quoted the article  The Popularity of Data Analysis Software by Robert A. Muechen before. It is a great article, Muechen maintains it as a living document constantly updating its contents. If you have not looked at it in a while it is worth another look.

There are two areas of Muechen's article that I would like to talk about. They are Internet Searches and Job Postings. For R both of these types of measurables are problematic and may under report the number of Internet Searches or Jobs Posted.

A person can effectively search for SAS, SPSS, Strata, but not really for R. The problem is that R is a single character that is heavily used ( Toys R Us, R Kelly, R rated movies, etc). Therefore a number of the searches for R may be for other things or are more likely to have other terms in the search line than SAS, SPSS and all. This results in a disconnect in the search results no matter how you define it. I will say as the number of R users has grown in the last few years this has improved. I ran the Google Adwords for the website at Revolution Computing and this proved to be a challenging problem to crack. It took a lot of thought and refinement that would not be required if R had a more unique name. Another search term to stay away from in Google Adwords is BI. It does not always mean Business Intelligence.

Job searches have a similar issue, but with an added twist. Rarely do job posters list the requirement as solely an "R" programmer, and rarely do people describe their skill as "R" programmer. Again I ran into this problem while trying to find talent for Revolution Computing. Job Posters and Job Seekers often list their "R" skill as "R/S", "R/S+","R/SPlus" or "R/S-Plus" and numberous other permutations of that. Again R's single character name is problematic.

On twitter this has been addressed by using the Hashtag of Rstats instead of R. The difference in ease of access and elimination of confusion has been huge. If R would ever expand its name to something like Rstats it would radically improve the quality of searches for information, and help employers find employees. Just a thought.

Thursday, July 14, 2011

Yankees and Tigers are the best and worst Run Differentail MLB teams of the decade

While I watched the Home Run Derby and the MLB All Star Game. I can not say that I was interested much in either one of them. They just do not feel important to me. I did get me thinking about what is important in Baseball. Run Differential is probably the most telling historical statistic in baseball. If you score more runs than you opponent then you should win more games.

In the last decade the Yankees are the team with the largest average run differential and the highest in a season is by the Red Sox at +210 in 2007. The Detroit Tigers have the most unfavorable run differential for a season in the last decade with an stunning -337 in 2003. That is simply amazing.

Bill James showed (runs scored)^2/((runs scored)^2+(runs allowed)^2))*162 is a pretty good predictor of a teams record for the regular season. TwinsGeek gives a short blog on that here. However, in baseball the prize is the playoffs. In that vien I ran across this blog post by Bill Petti. It got me to thinking that maybe in baseball like football offense pads your regular season record, but defense wins championships. Following that I found this post looking at the relationship between team payroll and run differential. I have to admit it make a strong case for implementing some type of payroll controls in baseball. Otherwise Major League Baseball could ended up looking like the Harlem Globetrotters versus the Washington Generals year after year.

However, there is a problem with that. A top ten payroll has only won the World Series a third of the time in the last nine years. While the teams with payrolls ranking from 11 to 20 have won 5 of the last nine, and one of the lowest third of payrolls has won 1 series. Compare that to only 2 teams ranked out of the top ten in runs allowed (2009 Yankees and 2007 Red Sox), and I think there is a story to be told here. High payroll teams can and do win World Series with offensive power, but middle and low payroll teams wins the series with defense. I can only assume this is because under current market conditions it is cheaper to build a strong defensive team than a strong offensive team. So far in 2011 the Yankees and the Red Sox has the best run differential while the Philadelphia has the fewest runs allowed.

Hope you enjoyed the All Star Break and are ready for the second half of the season.

Thursday, July 7, 2011

Stonebraker speaks out on Facebook, MySQL and VoltBD

Today I ran across an article where Micheal Stonebraker of MIT calls Facebook's dependance on MySQL "a fate worse than death". Why can't all tech visionaries be as quotable as Stonebraker? I quote Micheal Stonebraker a lot. He is a smart guy and a quote producing machine. I loved it when he called Mapreduce/Hadoop a step backward. I am waiting to hear more about his SciDB project which he is planning to link to R. The enterprise version of SciDB is sold through his company Paradigm4.

Of course Micheal Stonebraker has a solution for Facebook's terrible problem, NewSQL or as he calls his product VoltDB. Is it the cure for all that ails the data overwhelmed world of a internet companies? I can not tell you.  However, as long as Micheal Stonebraker is leading the charge and putting out the quotes I will be there to read it with a smile.

Article: Facebook trapped in MySQL "fate worse than death".
Article on VoltDB: PostgreSQL vs VoltDB

Wednesday, July 6, 2011

Twitter Sentiment

I know I continue to beat the Twitter sentiment thing to death, but there is just something about it that just does not sit right with me. I think there has been good research that finds Twitter feeds can be text mined for words like "colds" or "flu" and predict local flu outbreaks. I am not sold that Twitter is a good source for sentiment gathering on investment strategy. I do not think it works because mircoblogs like twitter are a random stream of consciousness provided by a population only loosely connected to the markets that investors are trying to predict. I believe a more relevant population with proof of action on their sentiment would be a much better indicator of trends. That is why I am a much bigger fan of using sources like Reddit. I wonder how the Twitter Sentiment Fund is doing. I can not find any numbers, but that is not really a surprise given that it is a private fund.

I will close this short post with a video called Twitter for Math Nerds . I wonder if the Twitter Sentiment Fund is heavily invested in Justin Beiber.

Tuesday, July 5, 2011

Does Attendence effect Winning in MLB

I started messing with this idea over the weekend because I like to watch TV and play with the computer at the same time. I was looking at various stats on MLB on the ESPN website, and I ended up looking at the various teams home field attendance numbers. There really are few surprises in the top ten, half are American League and half are National League with all except the Twins, Cubs and Dodgers having good seasons.

The bottom ten was more of a surprise to me. with 70% of those teams with poor attendance from the American League and Division Leader Cleveland Indians right there at 26th place. The Indians draw poorly both home and anyway.

This got me thinking if there is an impact of the mythical 11th man (or 10th man in the National League). So I thought I would look at if variations in attendance impacts the Indians performance on the field. The spreadsheet below shows the results for the first half of the 2011 season. It is interesting to note that the Indians have a much better record when attendance is less than 10,000 (6-1) or more than 30,000 (7-1) than they do when attendance is between 10,000-20,000 (10-7) or 20,000-30,000 (4-2). Correlation is not causality, but at least for this year the Indians do better when the 11th man shows or stays home, and worse when he kinda shows.


Attendance Runs scored Runs Allowed Run Diff
8726 7 1 6
9025 3 1 2
9076 8 2 6
9523 8 4 4
9650 9 4 5
9722 7 2 5
9853 3 8 -5
10594 1 0 1
10714 8 3 5
13017 4 2 2
13551 5 4 1
14164 5 4 1
15224 7 8 -1
15278 4 6 -2
15336 2 11 -9
15397 4 7 -3
15498 1 0 1
15568 9 5 4
15849 2 3 -1
15877 3 4 -1
16336 2 8 -6
16346 8 2 6
17568 4 3 1
18107 4 7 -3
19225 3 2 1
20261 0 2 -2
23752 2 4 -2
26408 2 14 -12
26433 3 2 1
26833 12 4 8
27458 0 4 -4
30023 5 2 3
31622 5 4 1
31865 5 1 4
33774 5 4 1
38549 5 1 4
40631 2 1 1
40676 6 3 3
41721 10 15 -5