So I am getting ready for the Fourth of July like everyone else. Independence Day has a dual meaning for me because not only do I celebrate the birth of our nation on that day, but I also celebrate my Anniversary. So the Fourth of July is a celebration of Liberation and Indentured Servitude on the same day. I try to stay positive. In an effort to do this I hunted up some visualization of the Independence Day ( there were not as many as I expected).
The first is one of two that I found on another blog. It is a fairly standard one of the history of our flag.
The second one is on Fireworks. Who doesn't love Fireworks?
The science and art that goes into Firework to produce those sounds, colors and shapes is just amazing!
On the more humorous side, Josh Sundquist did a Video called "4th of July for Math Nerds" which is pretty funny. I especially like the part where he says we celebrate our country by breaking its laws.
I was searching for more, but I could not find any I liked. Then I came across and map of last years oil spill spread.
I know this is not an uplifting video. However, it does remind me that despite the challenges of 2011, we are in a better place than 2010, and there is reason to celebrate. So have a great weekend and enjoy the Fireworks!
I blog about world of Data Science with Visualization, Big Data, Analytics, Sabermetrics, Predictive HealthCare, Quant Finance, and Marketing Analytics using the R language.
Thursday, June 30, 2011
Tuesday, June 28, 2011
Lighting Talks at Ruser Group Meetups
Recently there has been an increase in the number of lighting talks at Ruser Group Meetups. I first saw this type of talk at the R/Finance 2011 conference in Chicago at UIC. They were without a doubt the best group of talks from that conference. Lighting talks allow groups to cover a wide range of topics in a shallow dive for the listener to further investigate in the future if he or she wants to.
The lighting talks also provide a great format to present some of the more basic topics needed to develop the new R users and remind the more experienced user of good practice. I have become slightly concerned as of late that the R meetups were doing presentations of such a technical level that we were potentially alienating an important group in the R community, the newer user.
I am not sure if there were other people who agreed with me or if it is just a coincidence, but Lighting talks has swept through the Meetup world like a firestorm in the last month. The NYC Predictive Analytics Meetup did a series of Lighting talks as did the NYC Rusers Meetup. I believe they have already posted the presentations on the links that I have provided. The files and videos of presentations posted on these Group Meetup sites are a valuable resource that you should use even if you did not go to the presentation.
On July 13 the Greater Boston Area Ruser Group is doing as series of 5 lighting talks on Predictive Analytics in R, the GoogleViz package, R image cluster and 3D visualizations, RGraphViz and BiomaRt. On July 12 the Los Angeles Ruser Meetup is doing three quick talks on R for Multilevel Modeling, Unit testing and Debugging in R and calling C from R.
Another thing to note is a number of these lighting talks featured a presentation by the guys from RStudio. I think their IDE has quickly become the de facto user interface for R. It has shown itself to be useful to the experienced user, and an invaluable tool to speed up the learning curve for the new user. A friend of mine used RStudio in a class he taught in the spring, and he was astounded by how much faster his students become capable R users compared to when he taught this same class with RStudio in the past. That is a pretty strong endorsement from I guy who I know prefers to program from the command line.
The lighting talks also provide a great format to present some of the more basic topics needed to develop the new R users and remind the more experienced user of good practice. I have become slightly concerned as of late that the R meetups were doing presentations of such a technical level that we were potentially alienating an important group in the R community, the newer user.
I am not sure if there were other people who agreed with me or if it is just a coincidence, but Lighting talks has swept through the Meetup world like a firestorm in the last month. The NYC Predictive Analytics Meetup did a series of Lighting talks as did the NYC Rusers Meetup. I believe they have already posted the presentations on the links that I have provided. The files and videos of presentations posted on these Group Meetup sites are a valuable resource that you should use even if you did not go to the presentation.
On July 13 the Greater Boston Area Ruser Group is doing as series of 5 lighting talks on Predictive Analytics in R, the GoogleViz package, R image cluster and 3D visualizations, RGraphViz and BiomaRt. On July 12 the Los Angeles Ruser Meetup is doing three quick talks on R for Multilevel Modeling, Unit testing and Debugging in R and calling C from R.
Another thing to note is a number of these lighting talks featured a presentation by the guys from RStudio. I think their IDE has quickly become the de facto user interface for R. It has shown itself to be useful to the experienced user, and an invaluable tool to speed up the learning curve for the new user. A friend of mine used RStudio in a class he taught in the spring, and he was astounded by how much faster his students become capable R users compared to when he taught this same class with RStudio in the past. That is a pretty strong endorsement from I guy who I know prefers to program from the command line.
Monday, June 27, 2011
Boston Bruins $156K bar bill data visualization
The Stanley Cup was up on the Martha's Vineyard over the weekend. There was much drinking and celebration to mark its arrival on the small summer party island in Mass., but all that partying palled in comparison to what the Bruins team itself did to a bar at Foxwoods after their victory parade. They ran up a $156K bar bill which included a $100K bottle of champagne! Now that is one high brow drunk feast.
There has been much talk about this crazy bill. Some have even denied that it was really that much. However, there is a picture of the actual bill online. So I think that can put the doubters to rest. Now to get your head around what goes into a $156K bar bill there is nothing better than a visualization. Sixteenwins.com did just that, and it is a great look at what goes into such a massive bill.
The awesomeness that is the Stanley Cup Champion Bruins bar tab.
I believe this even dwarfs the booze consumed at R/Finance 2011 or the Second Continental Convention. Party on Boston! Sober up Bruins!
There has been much talk about this crazy bill. Some have even denied that it was really that much. However, there is a picture of the actual bill online. So I think that can put the doubters to rest. Now to get your head around what goes into a $156K bar bill there is nothing better than a visualization. Sixteenwins.com did just that, and it is a great look at what goes into such a massive bill.
The awesomeness that is the Stanley Cup Champion Bruins bar tab.
I believe this even dwarfs the booze consumed at R/Finance 2011 or the Second Continental Convention. Party on Boston! Sober up Bruins!
Thursday, June 23, 2011
Twitter Sentiment Presentations
I have already written about Jeff Gentry's TwitteR package which provides an interface to R. Last night the Boston Predictive Analytics Group gave not one, but three presentations on Twitter sentiment analysis. I have always been a big fan of the Greater Boston R users group meetup which is an unbelievable group with outstanding speakers. It looks like the Boston Predictive Analytics group is following in the same outstanding ways. There are just a huge number of great thinkers and speakers in the Boston area.
Any back to the three talks. One was a presentation by Jeffery Breen on how to use Jeff Gentry's TwitteR Package for R. The second is Shankar Ambaby's python NLTK to look at Twitter among others natural language applications, and finally the now famous twitter flu predictor presentation. All great stuff so take a look at the Boston Predictive Analytics website and go to the next meetup if you are near Boston.
Any back to the three talks. One was a presentation by Jeffery Breen on how to use Jeff Gentry's TwitteR Package for R. The second is Shankar Ambaby's python NLTK to look at Twitter among others natural language applications, and finally the now famous twitter flu predictor presentation. All great stuff so take a look at the Boston Predictive Analytics website and go to the next meetup if you are near Boston.
Great Sabermetrics videos with Bill James, cartoons and a Porn Star
I can across three fun videos on sabermetrics. The first is a collage of things including excerpts from a 60 minutes piece on sabermetrics. The tunes are not bad either.
This is actually not a bad introduction in Sabermetrics with spots on some of the luminaries of the field. In fact it is the first video I had ever seen of Bill James. I disagree with Sparky Anderson that Bill James is a "fat little bearded guy who knows nothing about nothing". Although he does have a beard.
The second video is a just totally hilarious. I played it three times and I laughed for the whole thing! I need to find a time to say to someone "because my brain is so big and I am so much better than you". Classic! This cartoon is part of a series and may not really teach you anything, but it is funny.
The only thing that is better than beer and the swedish bikini team is sports and sex. So there is, of course, a video that features Nikki Benz explaining Sabermetrics to a couple of guys preparing for their fantasy baseball draft.
I still like the cartoon better. They had a better script, and they are better actors than the people in Nikki Benz's video
This is actually not a bad introduction in Sabermetrics with spots on some of the luminaries of the field. In fact it is the first video I had ever seen of Bill James. I disagree with Sparky Anderson that Bill James is a "fat little bearded guy who knows nothing about nothing". Although he does have a beard.
The second video is a just totally hilarious. I played it three times and I laughed for the whole thing! I need to find a time to say to someone "because my brain is so big and I am so much better than you". Classic! This cartoon is part of a series and may not really teach you anything, but it is funny.
The only thing that is better than beer and the swedish bikini team is sports and sex. So there is, of course, a video that features Nikki Benz explaining Sabermetrics to a couple of guys preparing for their fantasy baseball draft.
I still like the cartoon better. They had a better script, and they are better actors than the people in Nikki Benz's video
The Art of Beer Making and the Development of Statistics
It is not a commonly known fact that beer making played a major role in the development of Statistics, but it is true. In the last 1800s and early 1900 Arthur Guinness and Sons was a very forward thinking agro-business. which hired some of the most promising students out of universities like Oxford.
In 1899 Guinness hired a young chemist and mathematician for Oxford by the name of William Sealy Gosset. Gosset would be the creator of the t-distribution which is used in Bayesian Data Analysis and the Students t-test. The Student's t-test is the common method for looking at the statistical significance of two sample means and linear regression analysis. Most of his work at Guinness was confidential because Guinness forbade its employee's from publishing any of their work for fear that they would give away trade secrets. However, Gosset was able to persuade Guinness to permit publishing his work and others under the alias Student to hide his relationship with Guinness arguing that this research was of little use to Guinness's competitors.
Beyond Gosset himself it is amazing the relationships he had with other notable people of his time. All of Gosset's papers were published in Biometrika which was edited by Karl Pearson. Pearson also helped Gosset with the mathematics of the t-distribution. Apparently Pearson was a better chemist and theorist than a mathematician. Albert Einstein would enlist mathematicians for the same kind of help. Pearson's contributions to statistics are also wide ranging from the Correlation Coefficient to the p-test to PCA. The first book read by Einstein's study group was Pearson's The Grammar of Science.
Gosset was also a friend of R.A. Fischer who in 2010 was named the Greatest Biologist since Darwin. R.A Fischer was not only a biologist but also an important contributor to statistical science. He is the one who recognized the importance of Gosset's work. Fischer created the term "null hypothesis" which rolls off the tongue of every classical statistician to this day. He also is known for the f-distribution, maximum likelihood and the Fischer-Kolmogorov equation.
It is interesting to note that Gosset was friends with both Pearson and Fischer even though both men had large egos and did not like each other. Maybe back then like today a man like Gosset with a few beers can assuage the ego and befriend even the most unlikely of fellows.
In 1899 Guinness hired a young chemist and mathematician for Oxford by the name of William Sealy Gosset. Gosset would be the creator of the t-distribution which is used in Bayesian Data Analysis and the Students t-test. The Student's t-test is the common method for looking at the statistical significance of two sample means and linear regression analysis. Most of his work at Guinness was confidential because Guinness forbade its employee's from publishing any of their work for fear that they would give away trade secrets. However, Gosset was able to persuade Guinness to permit publishing his work and others under the alias Student to hide his relationship with Guinness arguing that this research was of little use to Guinness's competitors.
Beyond Gosset himself it is amazing the relationships he had with other notable people of his time. All of Gosset's papers were published in Biometrika which was edited by Karl Pearson. Pearson also helped Gosset with the mathematics of the t-distribution. Apparently Pearson was a better chemist and theorist than a mathematician. Albert Einstein would enlist mathematicians for the same kind of help. Pearson's contributions to statistics are also wide ranging from the Correlation Coefficient to the p-test to PCA. The first book read by Einstein's study group was Pearson's The Grammar of Science.
Gosset was also a friend of R.A. Fischer who in 2010 was named the Greatest Biologist since Darwin. R.A Fischer was not only a biologist but also an important contributor to statistical science. He is the one who recognized the importance of Gosset's work. Fischer created the term "null hypothesis" which rolls off the tongue of every classical statistician to this day. He also is known for the f-distribution, maximum likelihood and the Fischer-Kolmogorov equation.
It is interesting to note that Gosset was friends with both Pearson and Fischer even though both men had large egos and did not like each other. Maybe back then like today a man like Gosset with a few beers can assuage the ego and befriend even the most unlikely of fellows.
Wednesday, June 22, 2011
Sabermeterics to predict pitching injury
I just finished The Extra 2% by Jonah Keri on the raise of Sabermetrics at the Tampa Bay Rays, and the resulting World Series that it won them. It is a quick read and a great introduction to the business of Baseball and how MLB teams have incorporated Sabermetrics into their management on recent years.
I met Keri up at the Sabermetric seminar at Harvard which was a fundraiser for the Dana Faber Cancer Institute. I believe they are going to do it again next year, and I strongly encourage anyone interested in Sabermetrics to go. It was my first time talking to people about sports statistics instead of running models and playing with data. Again if it happens again next year go!
In the last year at Bigcomputing we have done a great deal of work on predicting people's health for hospitals and healthcare companies with great success. The are a number of predictive analytics competitions that are trying to develop predictive models for things like predicting if a patient will be hospitalized within the week, month or year. The most well known of these competitions is the Heritage Health Care Prize for $3 Million dollars that is being hosted by Kaggle.com. Obviously with prizes of that size there is real potential to predict things like injury and disease.
Tom Tippett, head analyst of the Boston Red Sox, talked at the seminar about what things the Red Sox look at when they evaluate a player for contract. After this talk, I asked him if there was a predictive component for injury in their sabermetric models for predicting a players future performance. He said their was not. That surprised me, because I thought it could be done with the vast amount of data that is collected on the various players.
In Jonah Keri's book the Rays hired a guy who was able to predict injury of pitchers within a short time frame based on their Pitch F/X data. Josh Kalk published an article in Hardball Times called "The Injury Zone". He was later hired by the Rays where he has continued his work on Pitch F/X data among a mountain of other things. Injury prediction based on the physical and results data openly available in baseball is possible. Maybe even more so with the confidential information the teams have access to like Trackman data and medical and scouting reports. The key is being able to incorporate this into a predictive model with the idea of predicting injury. This type of risk analysis has a lot of value when you are talking about players who make an average of around $4M a year. I also love that Kalk's boss at the Rays was James Click. Click and Kalk should always work together.
I met Keri up at the Sabermetric seminar at Harvard which was a fundraiser for the Dana Faber Cancer Institute. I believe they are going to do it again next year, and I strongly encourage anyone interested in Sabermetrics to go. It was my first time talking to people about sports statistics instead of running models and playing with data. Again if it happens again next year go!
In the last year at Bigcomputing we have done a great deal of work on predicting people's health for hospitals and healthcare companies with great success. The are a number of predictive analytics competitions that are trying to develop predictive models for things like predicting if a patient will be hospitalized within the week, month or year. The most well known of these competitions is the Heritage Health Care Prize for $3 Million dollars that is being hosted by Kaggle.com. Obviously with prizes of that size there is real potential to predict things like injury and disease.
Tom Tippett, head analyst of the Boston Red Sox, talked at the seminar about what things the Red Sox look at when they evaluate a player for contract. After this talk, I asked him if there was a predictive component for injury in their sabermetric models for predicting a players future performance. He said their was not. That surprised me, because I thought it could be done with the vast amount of data that is collected on the various players.
In Jonah Keri's book the Rays hired a guy who was able to predict injury of pitchers within a short time frame based on their Pitch F/X data. Josh Kalk published an article in Hardball Times called "The Injury Zone". He was later hired by the Rays where he has continued his work on Pitch F/X data among a mountain of other things. Injury prediction based on the physical and results data openly available in baseball is possible. Maybe even more so with the confidential information the teams have access to like Trackman data and medical and scouting reports. The key is being able to incorporate this into a predictive model with the idea of predicting injury. This type of risk analysis has a lot of value when you are talking about players who make an average of around $4M a year. I also love that Kalk's boss at the Rays was James Click. Click and Kalk should always work together.
Crime Data Visualizations
Two years ago I was at a party with some friends of mine. Most of the people at the party worked for the FBI. As I am wondering around I strike up a conversation with a couple who both work for the Agency. The guys is an agent, but the girl is an epidemiologist. What? I did not know epidemiologist work for the FBI. Turns out she worked for the CDC before being recruited by the FBI to apply epidemiological methods to crime. Basically looking at crime hotspots and the spread of crime. She was using mostly SPSS at the time, but had also played with some models in R.
In the last few months I have seen a number of different sources play with crime data in visualizations.
At the most basic the state of Connecticut does a nice maps of the locations of sex offenders. It is basic with drop pins for each persons location, but it is a good interactive platform. I would include an image, but it has a disclaimer first so I will provide a link here. Trulia recently did an interactive heat map of crime map. This is a great simple visualization that allows you to drill down to street level from a high level of a USA map. It is still in beta and only covers a few major cities right now. The New York Times also did a cool scatter plot overlayed on a map of the city of murders.
Drew Conway just great time series heat map of a year of Chicago crime that is fun to watch. Drew also did a predictive murder map of Philadelphia. This takes me back to the epidemiologist at the FBI. I wonder to what level of detail can we predict crime. Could it be to the level that we could prevent some crimes by altering police patrols and enforcement or would those shifts simply shift the areas where crime is committed?
In the last few months I have seen a number of different sources play with crime data in visualizations.
At the most basic the state of Connecticut does a nice maps of the locations of sex offenders. It is basic with drop pins for each persons location, but it is a good interactive platform. I would include an image, but it has a disclaimer first so I will provide a link here. Trulia recently did an interactive heat map of crime map. This is a great simple visualization that allows you to drill down to street level from a high level of a USA map. It is still in beta and only covers a few major cities right now. The New York Times also did a cool scatter plot overlayed on a map of the city of murders.
Drew Conway just great time series heat map of a year of Chicago crime that is fun to watch. Drew also did a predictive murder map of Philadelphia. This takes me back to the epidemiologist at the FBI. I wonder to what level of detail can we predict crime. Could it be to the level that we could prevent some crimes by altering police patrols and enforcement or would those shifts simply shift the areas where crime is committed?
Tuesday, June 21, 2011
What is the Cloud?
At Sifma last week I had a meeting with a major cloud vendor who felt the biggest issue they were dealing with was a lack of understanding of what the cloud actually was. I was surprised by this, but I had seen a similar problems with the open source project Hadoop when it became a popular term, but few people actually understood what it was.
Yesterday I ran across a thread on Linkedin that asked how would a person describe the cloud to their co-workers. The answers where interesting and diverse. I will share a few here:
"Cloud = Commoditization of IT"
" cloud, in essence is a Fabric which at its core supports the application stack "
"For the end user, I would use the example of google documents, google calendar, gmail. Most people are familiar with Google and you can demo it as well. All the user need is internet access and a browser and they can basically access their "desktop" from anywhere with any internet connected hardware. "
"It is the internet"
Let not forget the Wikipedia definition of the cloud
When I think of the cloud I do not really consider those things like Salefoorce.com or Quickbooks.com delivering enterprise applications through a simple Website. I really limit my thinking to the ability to access remote computer time to run applications. A concept that really started with Amazon's EC2 in 2006. This was a great idea! Amazon had built massive infrastructure to handle the huge computing volumes of their business. However, they noticed that their business volumes were seasonal and a lot of their computers remained idle for much of the year. EC2 allowed them to rent out that capacity, and create another revenue stream for themselves. It was a true win/win. It was such a great idea that others vendors like Rackspace, Google and Mircosoft with Azure have entered the business.
The basic idea is sound and companies can save significant money by outsourcing their peak computer usage rather than maintaining the internal infrastructure to support that need. Because of economies of scale of these large cloud suppliers some companies may save money by outsourcing all of their hardware needs.
However, I do see a problem in recent history and on the horizon. Amazon correctly named their service EC2 ( elastic cloud). It was excess capacity that companies could use. What happens if cloud usage no longer is elastic? In January 2011 Netflix launched on amazon's EC2. Their volume has grown to 20% of total internet usage at any given time. This load along with the overall increase in usage of EC2 has resulted in problems including service EC2 being interrupted. I believe these sorts of periodic volume constraints will continue and increase in cloud computing. In the long term I believe it will be addressed just like it has in the past with a priority system and many levels of service by the cloud provider, but in the cloud provider's case backed by a pricing model.
Yesterday I ran across a thread on Linkedin that asked how would a person describe the cloud to their co-workers. The answers where interesting and diverse. I will share a few here:
"Cloud = Commoditization of IT"
" cloud, in essence is a Fabric which at its core supports the application stack "
"For the end user, I would use the example of google documents, google calendar, gmail. Most people are familiar with Google and you can demo it as well. All the user need is internet access and a browser and they can basically access their "desktop" from anywhere with any internet connected hardware. "
"It is the internet"
Let not forget the Wikipedia definition of the cloud
When I think of the cloud I do not really consider those things like Salefoorce.com or Quickbooks.com delivering enterprise applications through a simple Website. I really limit my thinking to the ability to access remote computer time to run applications. A concept that really started with Amazon's EC2 in 2006. This was a great idea! Amazon had built massive infrastructure to handle the huge computing volumes of their business. However, they noticed that their business volumes were seasonal and a lot of their computers remained idle for much of the year. EC2 allowed them to rent out that capacity, and create another revenue stream for themselves. It was a true win/win. It was such a great idea that others vendors like Rackspace, Google and Mircosoft with Azure have entered the business.
The basic idea is sound and companies can save significant money by outsourcing their peak computer usage rather than maintaining the internal infrastructure to support that need. Because of economies of scale of these large cloud suppliers some companies may save money by outsourcing all of their hardware needs.
However, I do see a problem in recent history and on the horizon. Amazon correctly named their service EC2 ( elastic cloud). It was excess capacity that companies could use. What happens if cloud usage no longer is elastic? In January 2011 Netflix launched on amazon's EC2. Their volume has grown to 20% of total internet usage at any given time. This load along with the overall increase in usage of EC2 has resulted in problems including service EC2 being interrupted. I believe these sorts of periodic volume constraints will continue and increase in cloud computing. In the long term I believe it will be addressed just like it has in the past with a priority system and many levels of service by the cloud provider, but in the cloud provider's case backed by a pricing model.
The Red Sox season is a case of Dr. Jekyll and Mr Hyde
In the first 36 games of the season the Red Sox went 17-19 while scoring 4.22 and allowing 4.47 runs per game. In the following 35 games the Red Sox went 26-9 while scoring 6.54 and allowing 3.91 runs per game. A 50% increase in runs scored over the same time period is usual.
Even more strange is if you look at the first 35 games the Red Sox never scored more than nine runs in a game. In the following 36 games they have scored more than nine runs 8 times. If I plot the run total frequency of the first 35 games of the season versus the following 36 games you get two totally different distributions.
The Standard Deviation for the first 35 games was a relatively tight 2.72 while for the following 36 games the Standard Deviation has ballooned 4.31. I have never seen a so flat a distribution of runs as the Red Sox have had in the last 36 games. Most of the runs scored distributions I have seen look like Chi Squared distributions
Even more strange is if you look at the first 35 games the Red Sox never scored more than nine runs in a game. In the following 36 games they have scored more than nine runs 8 times. If I plot the run total frequency of the first 35 games of the season versus the following 36 games you get two totally different distributions.
The Standard Deviation for the first 35 games was a relatively tight 2.72 while for the following 36 games the Standard Deviation has ballooned 4.31. I have never seen a so flat a distribution of runs as the Red Sox have had in the last 36 games. Most of the runs scored distributions I have seen look like Chi Squared distributions
Friday, June 17, 2011
Hadley Wickham video of Data Analysis with R and ggplot2
I ran across this recent video that Hadley Wickham gave at Google. Hadley always gives a good talk so I hope you enjoy it.
I am an Analytics Meetup Junkie
I am not sure if there is a 12 step program to address this issue, but I think I might form a meetup of meetup addicts (MeetAnon). Now I am a member of thirty Meetup groups and attend at least two meetups a week. I even have a collection of Meetup T-shirts.
It all started innocently enough. Years ago, my previous company funded some food for a New York R users group. I did not go to it at the time, but a year out of that company I was interested in what those people at the forefront of R development were working on. So I went to my first meetup. There was pizza and specialty ices, and a great talk by Andrew Gelman. After the meetup, the group went out for drinks and discussions about statistics. I was hooked, but I also needed more than one meeting a month.
I joined the predictive analytics meetup in NYC. I joined the Machine Learning group in NYC. I found that these groups shared a lot of commonality, but also looked at problems from slightly different angles. I started to think about visualizations of problems so I joined the Data Visualization Meetup in NYC. After I added the various Hadoop Meetups in NYC, I exhausted the ones in New York that interested me. I needed to expand my reach.
I joined the RUG in New Jersey, Boston, Washington DC, Chicago. I have also added sport groups to my stable of Meetups. Right now it is mostly Baseball and Sabermetrics, but I am sure that will develop as well.
Here are the links to my Favorite meetups:
Boston Predictive Analytics Meetup
Greater Boston R User Meetup
NYC R user Meetup
NYC Predictive Analytics Meetup
NYC Machine Learning Meetup
NYC Data Visualization Meetup
It all started innocently enough. Years ago, my previous company funded some food for a New York R users group. I did not go to it at the time, but a year out of that company I was interested in what those people at the forefront of R development were working on. So I went to my first meetup. There was pizza and specialty ices, and a great talk by Andrew Gelman. After the meetup, the group went out for drinks and discussions about statistics. I was hooked, but I also needed more than one meeting a month.
I joined the predictive analytics meetup in NYC. I joined the Machine Learning group in NYC. I found that these groups shared a lot of commonality, but also looked at problems from slightly different angles. I started to think about visualizations of problems so I joined the Data Visualization Meetup in NYC. After I added the various Hadoop Meetups in NYC, I exhausted the ones in New York that interested me. I needed to expand my reach.
I joined the RUG in New Jersey, Boston, Washington DC, Chicago. I have also added sport groups to my stable of Meetups. Right now it is mostly Baseball and Sabermetrics, but I am sure that will develop as well.
Here are the links to my Favorite meetups:
Boston Predictive Analytics Meetup
Greater Boston R User Meetup
NYC R user Meetup
NYC Predictive Analytics Meetup
NYC Machine Learning Meetup
NYC Data Visualization Meetup
Thursday, June 16, 2011
Sifma Conference
I have spent the last two days at the Sifm Conference in New York. Most of my time was spent going to talks by Sybase partners. One of the speaker was Micheal Kane from Yale. Mike has worked with me on a number of projects and now he is off on a new venture with Casey King. Their product basically allows you to do R analytics from A Sybase CEP. The talk was pretty cool. Mike and Casy's talks have garnered some attention as of late. Their last talk at the R/Finance conference on the flash crash is about to be featured in a Barron's article.
Overall I was impressed with the attendace at Sifma, and I will do another more detailed post after I get hoime and get some sleep.
Overall I was impressed with the attendace at Sifma, and I will do another more detailed post after I get hoime and get some sleep.
Labels:
Alera,
beginning r,
cep r,
flash crash,
Mike Kane,
sifma,
sybase
Tuesday, June 14, 2011
Resources to Learn R for new users
Historically R has a very steep learning curve, and there are many people who have become frustrated and given up before they achieved their goal of getting their R skills up to the level they need to make R useful to them in their field of interest. I am often asked to provide training to new R users which my current company and my previous company both do. In fact there are many companies that provide R training, and do an excellent job of it. The problem is this type of training is designed for a larger group of people and is simply not affordable for the single R user who wants to start with R or improve their skills.
There are some great books that can start you off in R. I actually learned R using the Baseball Hacks book by Joseph Adler which was a fun way to get me to the level I needed. I also know a number of people who started with R by using the R in a Nutshell also by Joseph Adler. While I have used this book in more recent times and have found it very helpful. Recently the R Cookbook by Paul Teetor has been recommended as a great way to start learning R. Again I have heard nothing but good things but this book. I have also heard that using Rstudio helps to speed up the process of learning R.
Also the R users groups are a good resource for help getting started in R. The Greater Boston R user group does a presentation before the main talk of the evening at each one of its meetups dedicated to addressing the needs of newer users. The slides for these talks are on their meetup site. A number of the people I have worked with used R-Help as a tool to assist them in learning R. I tried this a little when I started and it was not really helpful for me, but for people with other styles of learning it may be great.
If you are a member of ASA the various chapter periodically do introduction to R training for a very low cost. A few months back the Boston chapter of the ASA did a one day introduction to R for around $25.
Recently David Smith of REvolution Analytics announced a series on online training courses for about $400 . There are various training classes offered at industry conferences like Predictive Analytics World. You should expect to pay anywhere from $400 to $1,200 for a training class at these events. If you go to the New York Predictive Analytics Newbie R Training in October the instructor is Max Kuhn. You could not pick a better R instructor than Max.
I wish there where more resources out there to help the new R users along. I think choosing to approach R through a book or an introductory training class really depends on how you most effectively learn. I do believe that regardless of the path you choose installing Rstudio will help a great deal in your effort.
I know there have to be other helpful resources out there. If anyone knows of one please either email me or add a comment here, and I will add them to this post.
There are some great books that can start you off in R. I actually learned R using the Baseball Hacks book by Joseph Adler which was a fun way to get me to the level I needed. I also know a number of people who started with R by using the R in a Nutshell also by Joseph Adler. While I have used this book in more recent times and have found it very helpful. Recently the R Cookbook by Paul Teetor has been recommended as a great way to start learning R. Again I have heard nothing but good things but this book. I have also heard that using Rstudio helps to speed up the process of learning R.
Also the R users groups are a good resource for help getting started in R. The Greater Boston R user group does a presentation before the main talk of the evening at each one of its meetups dedicated to addressing the needs of newer users. The slides for these talks are on their meetup site. A number of the people I have worked with used R-Help as a tool to assist them in learning R. I tried this a little when I started and it was not really helpful for me, but for people with other styles of learning it may be great.
If you are a member of ASA the various chapter periodically do introduction to R training for a very low cost. A few months back the Boston chapter of the ASA did a one day introduction to R for around $25.
Recently David Smith of REvolution Analytics announced a series on online training courses for about $400 . There are various training classes offered at industry conferences like Predictive Analytics World. You should expect to pay anywhere from $400 to $1,200 for a training class at these events. If you go to the New York Predictive Analytics Newbie R Training in October the instructor is Max Kuhn. You could not pick a better R instructor than Max.
I wish there where more resources out there to help the new R users along. I think choosing to approach R through a book or an introductory training class really depends on how you most effectively learn. I do believe that regardless of the path you choose installing Rstudio will help a great deal in your effort.
I know there have to be other helpful resources out there. If anyone knows of one please either email me or add a comment here, and I will add them to this post.
Monday, June 13, 2011
Scientific American writes about Sabermetrics...sort of
In the June 5 issue of The Scientific American there is an article about baseball that looks at the chances of a batter being hit by a pitch. I am not sure that there is much significance to the finding of correlation between being hit by a pitch and temperature. I doubt there is even enough data in one season to make the kind of statements that the authors of this article make. However, if I was pitching and it was 95 degrees, I might bean the batter to get thrown out of the game and sent to the nice air conditioned locker room.
A couple of things jumped out at me in this data. First that less than .8% of at bats resulted in a hit batter. This number seemed much lower than I would have expected. The temperature choices also looked kind of arbitrary to me ( 95F and 55F). I mean starting with those temperatures couldn't you also draw a correlation that there are more hit batters in the middle of the season than in the beginning and the end. I also do not see a control for bias by ballpark or team which would have been interesting.
Just for fun I looked up the Don Baylor's and Craig Biggio's hit by pitch percentages which are 2.8% and 2.6% respectively. If Walter Johnson had been pitching to these two they would have gotten beaned every time they went to the plate. Although in Walter's defense he hit less than 1% of the batters he faced. When two players show such a deviation from the mean there must be more going on here than temperature because these two played even in cold whether.
I think the idea of looking at hit batters is an interesting one, but here I believe there was a strong desire to find a relationship with temperature. It would have been more interesting to look at all the potential factors in a hit batter (player, pitch, game situation, ball park, teams, weather, etc) and see what correlations existed.
Overall I am glad Scientific American took a shot at Baseball, but I wish they had taken a deeper dive into their chosen topic
A couple of things jumped out at me in this data. First that less than .8% of at bats resulted in a hit batter. This number seemed much lower than I would have expected. The temperature choices also looked kind of arbitrary to me ( 95F and 55F). I mean starting with those temperatures couldn't you also draw a correlation that there are more hit batters in the middle of the season than in the beginning and the end. I also do not see a control for bias by ballpark or team which would have been interesting.
Just for fun I looked up the Don Baylor's and Craig Biggio's hit by pitch percentages which are 2.8% and 2.6% respectively. If Walter Johnson had been pitching to these two they would have gotten beaned every time they went to the plate. Although in Walter's defense he hit less than 1% of the batters he faced. When two players show such a deviation from the mean there must be more going on here than temperature because these two played even in cold whether.
I think the idea of looking at hit batters is an interesting one, but here I believe there was a strong desire to find a relationship with temperature. It would have been more interesting to look at all the potential factors in a hit batter (player, pitch, game situation, ball park, teams, weather, etc) and see what correlations existed.
Overall I am glad Scientific American took a shot at Baseball, but I wish they had taken a deeper dive into their chosen topic
Video of NYC Machine Learning Meetup
This is a great group run by Paul Dix in New York. They usually meet at the AOL HQ. Two of the best talks that I have been to in the last year have been at this group. If you are in New York, and interested in Machine Learning go to this meetup!
I missed this talk on Bayesian Model Averaging, but I just watched the video. It is worth watching. Here is the link to the video:
NYC Machine Learning Meetup - Bayesian Model Averaging
I missed this talk on Bayesian Model Averaging, but I just watched the video. It is worth watching. Here is the link to the video:
NYC Machine Learning Meetup - Bayesian Model Averaging
Thursday, June 9, 2011
Probability in the Fourth Grade
Usually I am totally frustrated by the speed of mathematical education given to my daughter. I really feel that much more time and work must be done to get childern through the historical basics so that they can deal with the modern mathematical requirements of the world we function in. I have always used when people take calculus as the gauge of speed of math education. For example my father took Calculus as a junior in college while I, a generation later, took Calculus as a sophomore in high school. So I reached that milestone five years earlier than the generation before me. There was a cost for that speed up. My father has a much better feel for numbers than I do and can estimate results with far greater accuracy. However, it was an important price to pay if I was going to pocess the math skills I needed to pursue my career of choice in 1992.
Now another generation has past, and I feel our children need to be exposed to higher level math at an even earlier age than I was. It is critical if this generation is going to be able to comprehend the analytical models being used to attack problems today. There will be a feel cost for this just as there was for me versus my father, but how many people can really visualize an obect in N-demensional space anyway?
In the final weeks of fourth grade for my daughter I can not express to you the joy I felt when she came home with a probability problem and a matrix. Now my daughter may not fully understand that this was the math work she was given, but I did. The probability problem was a two dice problem with a comparison of getting a specific roll like 12 (1/36 same as all the others) versus getting a specific total like 7 (1/6). Then they listed all the possibilities and assured that they total to 1 or 100%. The second problem was an expected value matrix. This is the old problem of a guy is a room with three doors. Two of the room lead him down a hall for some amount of time before returning him to the room and the third door leads to freedom. They had to figure out the expected time it would take to get out of the room. This is cool stuff, and they were able to do both problems. I can only hope this kind of higher math instruction continues next year in fifth grade.
Now another generation has past, and I feel our children need to be exposed to higher level math at an even earlier age than I was. It is critical if this generation is going to be able to comprehend the analytical models being used to attack problems today. There will be a feel cost for this just as there was for me versus my father, but how many people can really visualize an obect in N-demensional space anyway?
In the final weeks of fourth grade for my daughter I can not express to you the joy I felt when she came home with a probability problem and a matrix. Now my daughter may not fully understand that this was the math work she was given, but I did. The probability problem was a two dice problem with a comparison of getting a specific roll like 12 (1/36 same as all the others) versus getting a specific total like 7 (1/6). Then they listed all the possibilities and assured that they total to 1 or 100%. The second problem was an expected value matrix. This is the old problem of a guy is a room with three doors. Two of the room lead him down a hall for some amount of time before returning him to the room and the third door leads to freedom. They had to figure out the expected time it would take to get out of the room. This is cool stuff, and they were able to do both problems. I can only hope this kind of higher math instruction continues next year in fifth grade.
Tuesday, June 7, 2011
Correlation does not mean causation.
I see people forgetting this all the time. I have thought about writting a short piece on it, and I have pushed it off again and again simply because it is not that much fun of an article to write. I mean the old example of "100% of people who drink water die" should be enough it show why just because there is a very strong correlation between two things ( drinking water and dying) it in no way means that drinking water causes death.
A harder concept may be what exactly is correlation? I think most people understand that it means some sort of relationship between two factors. On her blog Moved by Metrics my friend,Georgette Asherman, does an excellent post exploring the concept of Correlation.
Another blog I often read is Falkenblog. He usually does posts on fairly serious issues and financial matters. Not today! Today when I go this his site, it is a post with a link to a site that posts unusual correlations. I click it and the first one is the correlation between people who dislike mayonnaise and bad dancing. I am dying laughing because I hate mayonnaise and I am a terrible dancer. If you need a quick smile during the day go to correlated.org.
A harder concept may be what exactly is correlation? I think most people understand that it means some sort of relationship between two factors. On her blog Moved by Metrics my friend,Georgette Asherman, does an excellent post exploring the concept of Correlation.
Another blog I often read is Falkenblog. He usually does posts on fairly serious issues and financial matters. Not today! Today when I go this his site, it is a post with a link to a site that posts unusual correlations. I click it and the first one is the correlation between people who dislike mayonnaise and bad dancing. I am dying laughing because I hate mayonnaise and I am a terrible dancer. If you need a quick smile during the day go to correlated.org.
Looking for a free statisitical software?
I have been following a Linkedin discussion for a couple of weeks on what open source analytical software to use for a company with little money. I put up a recommendation for R, but it was really was wild to see the range of suggestions. One was just to use SAS because it really was not that expensive when you considered what you got. Others recommended R, Rattle, Knime, and a whole bunch that I have never even heard of. Overall a very entertaining discussion, but I an not sure it provided any real benefit for the person who posed the question. I will be the first to admit I have an R basis so my knee jerk reaction to the question was to reply R without fully understanding this guys needs. Therefore, another solution might be superior to R in his particular case.
One person responded with a link to The Impoverished Social Scientist's Guide to Free Statisticial Software and Resources by Professor Micab Altman of Harvard. First what a great title! Second what a fine resource. Yes it is a little dated with a last update in 2008, but I think it is still pretty on target even three years later. So if you are an Impoverished Social Scientist take a look. If you are simply a person wondering what open source tools are available to address you needs this is a good place to start.
One person responded with a link to The Impoverished Social Scientist's Guide to Free Statisticial Software and Resources by Professor Micab Altman of Harvard. First what a great title! Second what a fine resource. Yes it is a little dated with a last update in 2008, but I think it is still pretty on target even three years later. So if you are an Impoverished Social Scientist take a look. If you are simply a person wondering what open source tools are available to address you needs this is a good place to start.
Friday, June 3, 2011
R Bloggers posts a Sabermetric article
For someone just starting out in either R or Sabermetrics Pitch F/X is a great way to get into the hobby. This post has some simple code that produces interesting images of relevant data. My first introduction to R and Sabermetrics was the Baseball Hacks book and it spurred my interest in both. I also think some of the more interesting data publicly available is the Pitch F/X data because it is performance based data as opposed to results based data typical in baseball. Although I wish they would make the Hit F/X data and the Trackman data available.
I am working on putting together a Kaggle competition with a sabermetrics theme with Anthony Goldbloom and others. If there is anyone out there who has a suggestion on what would be good to be able to predict I would love to hear from you. I looked at using the retrosheet or MLBAM data for the dataset or anything that includes the Pitch F/X data. I would also appreciate any other recommendations on where to get the data for the contest.
Here is the article from R-bloggers by Millsy with Josh Weinstock
I am working on putting together a Kaggle competition with a sabermetrics theme with Anthony Goldbloom and others. If there is anyone out there who has a suggestion on what would be good to be able to predict I would love to hear from you. I looked at using the retrosheet or MLBAM data for the dataset or anything that includes the Pitch F/X data. I would also appreciate any other recommendations on where to get the data for the contest.
Here is the article from R-bloggers by Millsy with Josh Weinstock
Thursday, June 2, 2011
Vincent Carey BioConductor talk at Boston Meetup
A large group of Biostatisticians braved a tornado warning to attend last night's RUG Meetup in Boston. The less hardy finance guys were less likely to brave the conditions to join us for a simply great talk. Dr. Carey's talk provided us with a good example of the power of the Bioconductor package. He also talked a little about what has made R successful and some things R needs to address to continue to progress.
The slides for his talk are already up on the Greater Boston R User Group site. This is a simply awesome RUG that if you ever in the Boston area you should go to.
The slides for his talk are already up on the Greater Boston R User Group site. This is a simply awesome RUG that if you ever in the Boston area you should go to.
Thoughts on Data Visualization
Most of the Data Visualizations I look at on any given day are for entertainment not work. One of my favorite sites is Junk Charts. However, as the name implies junk charts deals with the not so good to the horrible in field of data visualization. Lately I have seen some really cool visualizations that I want to talk about and some reasons why I like them.
The first visualization I saw on Flowing Data is of the 11.3M deaths in Just Cause 2. This thing is just beautiful. It is also informative. It is easy to see where the hot kill zones are, and the places to avoid being. I was grateful they did not use my 20M deaths in Halo for this visualization
The second one is also from a Flowing Data a post of Airline routes. I had seen this type of graph done other ways before, but they were never interesting to me. This one is. I like that the line width is the indicator, and it is visually stunning. Here is another example of basically the same data, but the results are disappointing. I think this is for two reasons. First the coloring adds nothing, and the perspective angle is bad.
I also saw on Jeffery Breen's blog a visualization of 20 years of Air Travel in 20 secs. I thought this one was cool for a couple of reasons. First, I love dynamic charts. I think they are a great way to illustrate the progression of change in data. Second, this visualization gives me the option of choosing the representation that works for me ( Bubble, Bar, Line). Data Visualization is all about helping the reader understand what is going on. What better way to do that then allowing the reader to pick the format that works best for them? On that topic you may have noticed that two of the three visualizations that I liked where in Black and White. I believe that is because I am pretty severely colorblind. Many of the more intricate visualizations use colors that I can not differentiate. So rather than providing me with insight those colorful presentations leave me lost and confused.
The first visualization I saw on Flowing Data is of the 11.3M deaths in Just Cause 2. This thing is just beautiful. It is also informative. It is easy to see where the hot kill zones are, and the places to avoid being. I was grateful they did not use my 20M deaths in Halo for this visualization
The second one is also from a Flowing Data a post of Airline routes. I had seen this type of graph done other ways before, but they were never interesting to me. This one is. I like that the line width is the indicator, and it is visually stunning. Here is another example of basically the same data, but the results are disappointing. I think this is for two reasons. First the coloring adds nothing, and the perspective angle is bad.
I also saw on Jeffery Breen's blog a visualization of 20 years of Air Travel in 20 secs. I thought this one was cool for a couple of reasons. First, I love dynamic charts. I think they are a great way to illustrate the progression of change in data. Second, this visualization gives me the option of choosing the representation that works for me ( Bubble, Bar, Line). Data Visualization is all about helping the reader understand what is going on. What better way to do that then allowing the reader to pick the format that works best for them? On that topic you may have noticed that two of the three visualizations that I liked where in Black and White. I believe that is because I am pretty severely colorblind. Many of the more intricate visualizations use colors that I can not differentiate. So rather than providing me with insight those colorful presentations leave me lost and confused.
Wednesday, June 1, 2011
A look at Batting orders
There is one Blog I read on sports statistics religiously and the is Phil Birnbaum's Sabermetric Research. It is a great read, and he looks at many aspects of lots of different sports as opposed to just baseball. If you have not looked at his stuff before check it out.
One of his recent postings dealt with a paper written by Nobuyoshi Hirostu who looked at if using expected runs was always the best way to determine the batting order or could a lineup with a lower expected runs produce more wins because of lower volatility. Nobuyoshi used a cut down version of the game to calculate the expected runs and ran a MC calculation to determine the winners of each potential matchup. For this experiment he used the 2007 season.
Out of the 600,000 potential matchup guess how many instances he found where the lineup with the lower expected runs won more than 50% of the games? 13! I was surprised there were not many more than that. I expected there would be a fair number of lineups of high batting average singles hitters that might have a lower expected number of runs but wins against a lineup of power hitters who score more runs on average but have great volatility due to lower batting averages.
Based on Nobuyoshi's approach to this problem I think the results are surprising, but correct. However, I can see some potential problems with how he constructed his model for analysis. First, by building a cut down model for expected runs he may have reduced the volatility of the various lineups and made the winning potential for a lower expected run lineup less likely. Second, the lineups were based on the player makeup of the various teams. For whatever reason, ( in baseball I usually assume tradition) most MLB have a lineup consisting of Power Hitters and Reliable Hitters. I think a very interesting question to ask is if this type of lineup is optimal. What type of lineup gets the highest expected wins, and does it do it with highest expected runs or some balance between high expected runs and lower volatility?
One of his recent postings dealt with a paper written by Nobuyoshi Hirostu who looked at if using expected runs was always the best way to determine the batting order or could a lineup with a lower expected runs produce more wins because of lower volatility. Nobuyoshi used a cut down version of the game to calculate the expected runs and ran a MC calculation to determine the winners of each potential matchup. For this experiment he used the 2007 season.
Out of the 600,000 potential matchup guess how many instances he found where the lineup with the lower expected runs won more than 50% of the games? 13! I was surprised there were not many more than that. I expected there would be a fair number of lineups of high batting average singles hitters that might have a lower expected number of runs but wins against a lineup of power hitters who score more runs on average but have great volatility due to lower batting averages.
Based on Nobuyoshi's approach to this problem I think the results are surprising, but correct. However, I can see some potential problems with how he constructed his model for analysis. First, by building a cut down model for expected runs he may have reduced the volatility of the various lineups and made the winning potential for a lower expected run lineup less likely. Second, the lineups were based on the player makeup of the various teams. For whatever reason, ( in baseball I usually assume tradition) most MLB have a lineup consisting of Power Hitters and Reliable Hitters. I think a very interesting question to ask is if this type of lineup is optimal. What type of lineup gets the highest expected wins, and does it do it with highest expected runs or some balance between high expected runs and lower volatility?
Subscribe to:
Posts (Atom)