Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: 2012

Wednesday, November 28, 2012

Jared "The Statistician" Lander gets his 15 minutes

With the huge Powerball Lottery every TV News Channel in the country is bringing in a statistician to talk about the odds of winning. CBS News New York brought in local statistician Jared Lander.

Here is the Link:

Monday, October 29, 2012

Photos of Hurricane Sandy

I just walked around my neighborhood in Milford Connecticut. Hurricane Sandy is just starting to impact the low laying streets below our house. There is a little wind, but the seas are diffinately higher than usual. I have attached some pictures and a video for all to see what it is like here on the Milford shore

So the first high tide was not that bad. I would expect the high tide tonight to be much worse.

Wednesday, May 16, 2012

Connecticut R Meetup has second Meeting

With a second meeting under our belt this R meetup group seems to be well on its way. This meeting featured a double in attendance from our last meeting with over 20 people braving the rain to join us.  We also 10 new members on line. I am pleased with how well we have done so far.

We are going to try and do these meetups once a month. Any member cango to the meetup site and suggest a topic for a meeting or even a time and location.

I want to thank Jay for his presentation on TAQ data and Rajarshi his informative talk on twitter sentiment data and geo-location information.

Please  feel free to join the group.   Connecticut R Users Group

Thursday, May 3, 2012

Saints Bounty scandal or is this just the logical extension of what is done at the high school and college level?

When I was in High School we were given stars for big plays. If those plays where on defense they were almost always for big punishing hits. They were always legal hits, but they were big. You made a big hit you got a star it was that simple. It was an effective incentive.

Now my college did not give out stickers to put on our helmets so that incentive program did not exist at Cornell, but it did and still does at other universities. Florida State gives out Tomahawks for great plays. OSU gives its players Buckeye stickers for outstanding performance. Georgia players get white dog bones for a good play on the field. Clemson players get a paw print sticker. Even Stanford players have participated in this practice.

Lets be very clears these stickers are given out for things like big hits. I did not believe they are ever given out for illegal hits. In fact, I believe most of these schools they will actually take stickers away from a player for an illegal hit. However, I do believe if you put a good legal hit on an opposing player that is worthy of getting a sticker, you will get that star regardless of whether that opposing player was injured or not.

To me what happened at the Saints is a logical extension of that system into the NFL. Did it get corrupted in the translation. Sure. I believe it got out of line because instead of rewarding good plays which includes big hits it rewarded big hits that resulted in injuries. However, I do not believe any of the Saints players were reward for big hits that resulted in injuries to other players that were illegal. If these same hits had occured on the field at OSU, FSU or countless high schools across the nation the players would have been rewarded for their play with a sticker.

Now me personally I wanted anyone I put a big hit on not to be injured. I felt it was better if they stayed in the game after the hit. I figured a good hit would make them afraid to come at me again because they knew I would knock them out again. If he was afraid of me before the ball was even snapped, I had already won the battle before the play even started.

Wednesday, May 2, 2012

Jo-Ann Mettler's Art Video

I was up late tonight, and I googled my mother name. To my surprise a video came up of my mom talking about her art. I really enjoyed hearing her talking about doing what she loves. I hope you enjoy it.

Tuesday, April 24, 2012

Red Sox off to worst April in over a decade

2012 continues to be a year of struggles for the Boston Red Sox as they show more wear and tear than the 100 year old Fenway that they play in. This a quite a change for the first decade of the new century.

The 2000s have been years when the mighty Red Sox roared out of Spring training with strong starters, a powerful closer and might batters. Their April record inspired talk of winning that first longed for world series or another one this year. It seems so long ago.

Now the Red Sox lumber into the new season with an unsure rotation, an opening at closer and bats that still have not emerged from a winter in storage. It is hard to get depressed in spring, but a 5-10 Red Sox will do it for me.

I am believe we are seeing the Red Sox move away from the things that made them successful in recent years. Management like Theo Epstein and Terry Francona have moved on. Players have aged and retired like Wakefield, Manny and Varitek. The Red sox need to return to what made them successful. A strong front office driven by Sabermetrics. I hope it happens soon.

Tuesday, April 17, 2012

Next Connecticut R Users Group Meetup is May 15th. #rstats

We have confirmed the room at the Yale Statistics Department on Hillhouse Avenue for May 15th for our next Rusers Meetup. The meeting with start with two short lighting talks and be followed by an R-Lab similar to the very successful Stats-Lab that Dr. Emerson has done at Yale for a number of years now.

The first presentation will be by Rajarshi Guha with a lighting talk exploring twitter data with R+basic sentiment analysis. Twitter feeds are a great way to try out different types of analysis and packages like Jeff Gentry's Twitter-R package make it relatively easy to do.

The second presentation will be by Illya Mowerman with an example of using logistic regression in R to predict leads and converters in Marketing. Regression really forms the foundation of a lot of predictive analytics so this should be interesting.

If you have a data set you want the group to play with or some code that you could use help with bring it to the meeting and we will work through it as a group in the R-Lab portion of the meetup.

Friday, April 13, 2012

Johnny Damon signs with the Cleveland Indians

This morning it was announced that Johnny Damon signed a one year contract with the Cleveland Indians. I have always been a Johnny Damon fan, and I appreciate the skill with which the Indians have been run as a Sabremetrics team with the help of Keith Woolner for many years. However, I can not see how this makes sense from any perspective except Damon has found a team with such a large void that he will get a chance to play.

I am also surprised that Damon is coming to the Indians as a left fielder. He has not regularly played in the field since 2009, and it was my understanding that other teams were really looking at him as a designated hitter. Damon can still hit with a .261 with a .743 OPS last year with the Rays. There no doubt he can still run with 19 successful steals out of 25 attempts last year. However, he is 38 years old and playing the outfield is not a short sprint to the next base but constant and continual movement over the course of an inning.

Maybe the Indians are just trying to fill the hole they have in left field any way they can or they see something in Damon that no one else does. The Indians are really good are seeing things other people miss. The Johnny Damon fan in me hopes it is more than that. I wish Demon Damon a great year.

Thursday, April 12, 2012

A great start to the Connecticut R Users Group

Tuesday night was our first meetup, and it went off exceptionally well. Jay is a great leader for a discussion, and the ten plus people who came to our first meeting really got a treat.

Rather than do a strict presentation Jay just threw up live onto the screen two projects that he was working on. There is no better way to show the power of R of than in exploratory data analysis. In minutes Jay was able to read in a data set from the web, clean up that data and play with it. I do not believe any other language can do this type of work with the speed and ease of R.

This format went so well that we will continue to use this format for the Connecticut R Users Meetup with a little modification. The basic format will be one or two lighting talks by members about what they are working on in R followed by a bring in your problem/code session.  The second part is similar to what Jay has been doing with his Statistics Lab at Yale for years with great success. The idea is someone brings in some code and/or a data set that they are working on and having trouble with. The group than works on that problem collectively to develop approaches and implementations to exact information from that problem.

I am looking forward to our next meeting in May.

Monday, April 9, 2012

Red Sox headed for a worse start than 1945

The Bobby Valentine era at Fenway park has started with a fizzle.  The Red Sox have entered the 2012 season the same way they began the 2011 campaign, on a losing streak. In 2011 the Red Sox started 0-6  before beating the New York Yankees and ending their quest to have the worst start in Red Sox history.

This could be the year that the Red Sox finally do away with a record that has stood for over sixty years. In 1945 the Red Sox started the season with eight straight losses. That streak ended when they beat the Philadelphia Athletics on April 28, 1945. The 2011 losing streak ended with a victory over the Yankees. There is a chance the Red Sox could end this current steak by beating the Yankees on April 20th at Fenway. The Yankees are currently off to a winless start as well.

For the last five years the Red Sox have had a poor first two weeks of the season starting every year with a losing record. It is an odd development that could just be the result of chance.

Given the worst start ever by a Major League Baseball team is 21 games by the 1988 Baltimore Orioles the Red Sox all time worst start of 8 games is fairly unimpressive.

Wednesday, April 4, 2012

Update on 1940 Census release.....Servers overwhelmed

I guess I was not the only one waiting for the release of the 1940 US Census Documents in digital format on April 2. The traffic was so high that it overwhelmed the servers. Here is a link to the USAToday article on the server issue.

I have not done any serious work with the data. I have just had fun with it so far. I looked up my grandparents, and the report included both my mother and father which was cool. I did notice how slow the website was, but I thought it was on my end and not related to traffic on their end. I guess I was wrong.

Friday, March 30, 2012

1940 US Census Raw Data to be Released on April 2nd - Analytics Geek rejoice!

Yes there is nothing better for an analytics junkie than to get access to a big data set that has the added benefit of the raw data being available in digital format. So it is with the release of the 1940 United States Federal Census Data to be released on this coming Monday. This data is not anonymized and contains the name, age, sex and location of all 130 million Americans in addition 5% of respondents were asked supplemental questions that are also contained in this data. This should be a fun Open Data Set to play with.

Upon its release, the 1940 U.S. Census Community Project, a joint initiative between, FamilySearch,, and other leading genealogy organizations, will coordinate efforts to provide quick access to these digital images and immediately start indexing these records to make them searchable online with free and open access. They are looking for volunteers to help in this effort.

So Hack away and enjoy this new data set.

Where is the sustianable Sushi in California?

I went out to California to visit my parents over the spring break. On that trip we went to the popular local sushi restaurant. The place was called Okura. Apparently this is the place where all the  Professional Tennis Players go to eat during the BNP Open in La Quinta.

I figured that a sushi place in Southern California would be progressive and cutting edge. I was wrong. There was no evidence of any effort to be environmentally conscious. The chopsticks were disposable and the seafood was unsustainable. Being a regular of Miya's in New Haven I had not seen this type of seafood in years. I found I was not missing anything. The flavors were bland, and the food lacked creativity. Okura served the old standards of rolls with Tuna, Salmon and shrimp. I found these could not stand up against Miya's Lion Fish, Scup and Asian Shore Crabs.

With all that has happened lately in our understanding of what we can and should do to eat seafood in a sustainable way it is time for all to follow the direction led by Miya's sushi. Sustainable Seafood is now mainstream with Grocery Store like Wegman's and Whole Foods leading the way.

Thursday, March 29, 2012

Murray Lender passes away... thanks for every thing.

Last week Murray Lender of Lender's Bagels fame passed away. I had not thought of him for years. I one met him a few times, but he and his family did some little and big things that really made growing up in New Haven a great experience.

I went to school with a number of the Lender family kids. Every week a truck from the Lenders' Bakery would drop off some fresh bagel to be given out for free at snack time. They were great. To this day when I think of getting a snack the first thing I think of is getting a bagel because not only do I like bagels but it bring me back to the pleasant memories of my youth.

In the 80s the Lenders opened a restaurant in Hamden that served all food that could include bagels. There were bagel pizzas and bagel burger. That was the first time I ever saw a bagel pizza, and I still see them whenever I go to the store in the freezer section. The bagel burgers are still my favorite burger of all time. The is simply nothing better than a onion bagel burger. It is awesome!

After they sold Lenders Bagels to Kraft the Lender family expanded their support of the New Haven Community and the schools their kids went to. I enjoyed the result of their generosity when I was young and now my daughter is using those same facilities today.

I would say Murray Lender will be missed, but I see him every day in New Haven at the JCC, Hamden Hall and Foote.

Tuesday, March 13, 2012

March 14 is International Pi day

Yes, it is that time of year when all practicing Geeks gather around the round table to celebrate that most sacred of days to honor the mighty Pi. Not the Apple Pie of All-American fame or the Cherry Pie we all craved in our teenaged yours, but Pi the ratio of a Euclidean circle to its diameter or roughly 3.14.

Although I have always been partial to the worship of Avagodros Number as a better representation of the dimensionless representation of matter, it is the humble Pi the has attracted the most ardent followers in recent years. Just like most religions, numerists have placed their holidays on top or near popular pagan holidays of the past. Passover and Easter come up yearly around the time of the ancient spring festival. So it is that Pi Day comes to us every spring to mark the exit from mathematical ignorance that started with the Greeks. Originally Pi was know as Archimedes Constant, but since you can traditionally only have one thing name after you and there was the Archimedes Screw ( I love that name!) Pi became known as Pi.

Pi day is celebrated just like any other religious holiday. We eat traditional foods ( round).

We watch movies about the thing we worship.

Not exactly Santa Claus is Coming to Town, but not bad. There is also a website for Pi Day and a facebook Page. So kick back eat a pizza at a round table and throw a ball around this Pi Day. Thanks to Jared Lander for introducing me to this most special of days.

Monday, March 12, 2012

Predictive Analytics for March Madness 2012

For a couple of years now Danny Tarlow and Commisioner Lee have hosted a Predictive Analytics competition for March Madness. It even got some press last year:"

Software to predict 'March Madness' basketball winner

MacGregor Campbell, consultant
BasketBall.jpg (Image: Jonathan Daniel/Getty)
Fine, computers, you can beat us at chess and Jeopardy!, just please let us keep March Madness. With the US National Collegiate Athletic Association's basketball tournament starting today, contestants in the second annual March Madness Predictive Analytics Challenge are attempting to build software that can pick winning teams better than humans.
The contest pits machine against machine to find out which algorithm can correctly predict the outcome of the 64-team contest. Tournament brackets must be chosen entirely by computer algorithm, and no specific team-based rules, such as "always pick Duke over North Carolina", are allowed. All contestants are restricted to using the same data set - team and player statistics from the 2006 season until last month.
Contest organiser Danny Tarlow's own entry started out as a movie recommendation engine similar to those used on sites like Netflix. He says that predicting what movie a particular person would like to see is similar to predicting how well a basketball team's attack will do against their opponent's defence: both interactions are driven by unknown rules.
To predict the result of a basketball game, his algorithm chews through loads of regular season data and uses probabilities to find equations that fit the outcomes of each game. It then uses these equations to pick which teams will win in tournament match-ups. "The algorithm knows nothing about basketball or details about any team. It just sees the outcome of each game in the season, and it tries to discover latent characteristics that best explain the outcomes," he says.
Other entries range from using genetic algorithms to evolve equations that can pick winners to more straightforward attempts to boil down a team's strengths and weaknesses to a single number, then pick the team with the higher number in each match-up.
Last year's contest had 10 entries, including a "pace" bracket that simply picked the higher-seeded team in each matchup. Six of the entries did better than this baseline, one even predicting underdog Butler University's surprising ascent to the final four.
Tarlow hopes for a better performance this year, but is well aware of the difficulty of predicting the outcome of an entire basketball tournament. "There's clearly a lot of luck that goes into having a successful bracket."
We'll know how the software programs fare soon - the round of 64 begins today."

I often think that the world of predictive analytic competitions is made up solely of Kaggle competitions, but there are lots of others out there.These two guys have run a good contest for a while now so I encourage everyone to give it a try especially if you are an R user instead of a Python guy.

I played with some models to do this but none of them where ever outstanding.  One I liked had a factor for streaky teams. I found that teams who had long runs of multiple wins tended to do better than those teams with similar records who did not. When  I further tuned this with weighting for things like streaks later in the season and level of competition it seemed to do better than anything else I tried. If you have the time don't just fill out a bracket predict one.

Thursday, March 1, 2012

Playing Golf with President Clinton in Palm Springs - Never pass up a special experience

My father has played in the Bob Hope since the 1960s. He has always enjoyed the event, but this year he thought he would pass on it because he was too busy. My father is 70 and has been retired for over a decade. When he told me of his plans to pass on this tournament I told him I thought he was crazy. The Hope is a unique and special opportunity that I would be grateful to experience just one in my life. Most people never get do something cool and special like this in their entire lives. I asked him to reconsider and play in the tournament.

To my great pleasure he did reconsider and joined the field of amateurs at the Hope. The result was unbelievable.

The Tournament is no longer officially called the Bob Hope Desert Classic, but now goes by the name Humana Challenge with the Clinton Foundation as its lead sponsor. It is a four day tournament where amateurs are teamed up with Pros for three of the days and play on three different courses.  My father's draw was in the celebrity field so he was on television and got to talk with many of the celebrities who played in the tournament. He also got to play with some well know PGA Pros. On the first day he played with Phil Mickelson. On the second day he played with Richard E. Lee, and on the final day he was supposed to play with Bud Cauley.

That was not even the best part. On the first day President Clinton joined my father and Phil Mickelson on a few holes! That is why you never pass up on a unique experience. It could be even more unique than you could ever possibly imagine.

Monday, February 27, 2012

Data Mining is not new or scary. Target can predict if your pregnant. Walmart can predict when you will buy Pop Tarts and Beer. MIT students can Predict if you are straight or gay

Data Mining is not new or scary. Humans have been collecting information and using that information to better understand human behavior since before recorded history. The only change is that as data storage and computer processing speed has increased the ability of Data Miners to study larger data sets and use more complex model has increased. The UPC codes that are scanned at checkout were put there to allow stores to collect data on their customers and the shopper cards we all use were not created to help us but to allow stores to better track our purchases and behaviors.

The first main stream article I can remember about this was about Walmart addressing the needs of its customers before a major storm. By analyzing the data of stores before major storms Walmart learned that people bought Pop Tarts and Beer. In fact they bought Pop Tarts and Beer at seven times the normal purchasing rate. It is interesting to note they did not just buy Pop Tarts, but Strawberry Pop Tarts. I guess Strawberry just goes better with Beer. Walmart used this information combined with weather data to ship massive amounts of Pop Tarts and Beer to their stores in advance of major storms. The result was the stores did not run out of these products as they had in the past and Walmart increased their revenue.

Here is a Link to that article: What Walmart Knows about their Customers

The recent Target story is really not very different than that the Walmart article of five years before. The additional information that Target had to use as they sifted through the data was that they had the individual customer data. This additional level of detail was achieved by getting customers to use a Target credit card or a shopper card. That enabled Target to link the purchases of particular items to a customer they had information about (age, location, income, etc.). This type of personal information allows retailers to direct promotions to specific customers ( in this case pregnant women) in an effort to increase and maximize revenue.

Here is a Link to that article: How Companies Learn Your Secrets

Retailers have improved their predictive analytics massively in the last decade. Where retailers used to be concerned with figuring out what customers in a region would purchase now are working on what an individual consumer wants to purchase before they walk in the store or go online. Retailers are getting very accurate and specific. A good example of this was the Netflix prize which was an open predictive analytics competition that produced an improved movie recommender for Netflix customer.

None of these efforts are evil in terms of what they were trying to achieve. The goal is to accurately identify the wants and needs of their customers which would result in greater revenues for the retailerand better service for the customer. That is not a bad goal.

Providing the goods people want, and not the ones they don't was a key issue for 7-11 stores. They have achieved this through years of data point of sale collection with great effect. Here is a case study on data analytics and 7-11. Gone are the days of stale out of date food at 7-11 that no one will ever buy.

However, there is a potential dark side to all this data collection and predictive analytics on it. It can be used in to discover things about individuals that the did not want or agree to reveal to others. It can be abused.

A Research paper by two MIT students showed that by examining the friends of a person on Facebook one could predict the sexual orientation of that person. The problem arises when that individual is not ready to expose their sexuality to the outside world. Similar to when Target revealed to a father of a teenage girl that she was pregnant. That young girl may not have wanted her father to be aware of her pregnancy. Here is a link to the MIT paper.

Recently the American Civil Liberties Union has expressed concerns about the data collected from traffic light cameras. Apparently this data becomes available to both the government and the company the government engages to collect the data.  It also can become available to anyone who requests it through the Freedom of Information Act. Their concern is that here data is being collected without the permission of the individual. This is different from the Target situation because customers were providing their personal information to Target. This is also the case in License plate scanning that is done by a number of cities and towns in Connecticut. The ACLU of Connecticut has filed suit to force the towns to periodically purge the data and have reasonable controls on it for privacy.

Data mining and predictive analytics are here to stay because they are powerful and useful tools to the organizations that use them. In many cases they provide insights and results that not only benefit the organizations that employ them but to the community at large. If you feel that you do not want to be part of this system than do not participate. How? That is easy.  Buy everything with cash. Do not uses any frequent shopper cards. Do not use/own a cell phone, and stay away from the internet. There is a price to opting out of the system just are there is a price for choosing to be part of the system, but it can be done.

Wednesday, February 22, 2012

SAS versus R - The longest discussion on Linkedin I have ever seen

Six months ago Oleg Okun asked the posed the following question to the Advanced Business Analytics, Data Mining and Predictive Modeling Group on Linkedin:

SAS versus R

Did anyone have to justify to a prospect/customer why R is better than SAS? What arguments did you provide? Did your prospect/customer agree with them? Why do you think, despite being free and having a lot of packages, R is still not a favorite in Data Mining/Predictive Analytics in the corporate world?

I responded to the original question and engaged in a some discussion as time has gone by. It has been fun and interesting. The range and breath of this discussion thread, and the number of participants is amazing. It has never gone stale and there are new contributor every day. The most recent topics include Oracle R Enterprise which was not even in existence when the question was originally posed by Oleg.

Here is a sample of the Discussion:

John Charnes • @Daniel Lieb -- Hope all's well with you, Dan. Yes, the dig on R has been limits on the size of data sets that it can handle. However, I listened to a webcast last week that described Oracle R Enterprise. The speakers described analyses of terabytes of data by running existing R scripts directly against data stored in Oracle Database 11g. I haven't used it yet myself, but is definitely worth checking out.

Ashish Patel • Support and security is the prime concern for Corporates when it comes to Open source tool adoption..
Last thing business want is tool failure resulting in Business bottom line impact...
Also Enterprisewide configuration and updates management plays crucial role too...
Ex: after Red Hat started providing support for Linux , enterprise adoption increased..

David Tussey • A side comment, I see Python and the various packages (SciPy, NumPy) as the emerging winner in this battle. Python is much easier to program than R, and more "data friendly" when it comes to file and database manipulation.

Greg Sterijevski • Some general comments on SAS vs R:

R: tons of estimation technique, experimental as well as tried and true ones
SAS: not as many techniques, a long experimental dev stage before a proc is considered finalized

Usage Consistency:
R: None or minimal, each package is its own point process
SAS: Very consistent use across procs and packages. If a class statement is supported by a proc, then it works identically to the class statement in any other proc. Concommitantly, all elements of a technique are typically fleshed out before becoming production. R might have an estimation technique 'XYZ', in SAS that technique is typically not considered finished unless it implements a slew of ancillary functionality (like hypothesis testing, parameter restriction, hccme covariance matrices)

R: pretty good coverage for base packages
SAS: a humongous test library accumulated over 20+ years. A consistent, if not complete, testing philosophy.

Numeric Consistency:
R: Not sure, have only run R on Win and Linux
SAS: Numerically 'close' results on platforms from mainframe to PC. Furthermore, an almost slavish insistence on numeric results not changing for a given technique from release to release. In other words, if your actions change the results (even insignificantly) of benched tests, you'd better have a very good reason.

Output reusability:
R: Excellent. The techniques typically yield a structure which whose members can be accessed in a very natural way. One can chain together strings of calcs to come up with much more complicated 'meta models' very easily.

SAS: Start of the art in 1980. I haven't kept up with their improvements in the last few years, but the only reasonable way to capture and reuse output use d to be an OUT= statement (if supported) or snatch output from the ODS. SAS/IML is a bit of standout in that it works a bit like R or Python, but the feature set is not as complete.

Overall, there is not a clear winner. If your client is looking for a supported, vetted analytics engine then SAS edges out the competition. If you are a startup with a bit more time than money, R or even an open source library like Apache Math Commons, Mahout, or NumPy will do. For the organization which is not afraid of bit of coding, the open source solution offers the ability to tailor their analytic system to the business need.
Alfredo Roccato • In the real world (I'm speaking of large commercial organizations) where 80%-90% of the time is spent in large scale data processing, SAS has proven to be a very efficient and flexible tool. In an academic contest, where most of time is spent in analysis, mainly dealing with toys data, no doubt that R is the preferred software. In my opinion these packages do not compete each other, even if there is a considerable overlap for statistical methodologies. Rather, a better communication would benefit both: you can use SAS for complex data manipulation and R for all the analyses written by the moltitude of its contributors.

If you are on Linkedin join the group and the "SAS versus R" thread.


Marie Colvin, journalist, who was killed in Syria was from Long Island and a Yale graduate

News of two more journalists deaths in Syria passed by me as just another number last night. In the sad state of Syria today two more dead is a small number even by the a daily body count.

However, this morning it hit me that these are no bodies these are people. So who was this American journalist who was killed by shelling in Syria on February 21, 2012? Here is her last report:

She was Marie Colvin. She grew up in Oyster Bay Long Island where she did well in school which allowed her to enter Yale in 1980. She graduated from Yale in 1984 and immediately enter journalism. She loved her job and excelled at it. Over the next 20 years Marie covered all the hot spots around the world, but especially the middle east. She was a true war reporter. She was in the documentary film Bearing Witness in 2005.

In 2001 Marie lost her eye to shrapnel in Sri Lanka in 2001. She lost her life to artillery in 2012.

Monday, February 20, 2012

Brewing my own Sake. First attempt....

Most guys who get into home brewing brew their own beer or make their own wine. There are lots of resources out there and people to talk to about how to make a good product. That would be too easy. I have always wanted to brew something at home, but why do something easy. A couple of my friends tried to brew sake and they said their results were undrinkable. When I can not do worse than those who tried before me I have found my calling, home brewing sake.

The only resource I have found to aid in my efforts of home brewing sake is Will Auld of HomeBrewSake. I have ordered their home brew sake kit that comes from the All American Sake Company better known as SakeOne. Now I wait for the material to come to start the roughly ninety day brewing process.

Remember I am only trying to make something that is drinkable to me. Given that I was able to drink the Applejack that I brewed in my dorm room back at Choate I figure i should be able to drink just about anything.

Friday, February 17, 2012

Target knows when you are pregnant and a statistician gets credit for it in the New York Times

I do not believe that anyone is unaware that retail companies like Target collect and analyze data of their customers. I do think the amazing thing is the type of things that can be learned from analysis of this data and the accuracy of that insight. We are human and therefore want to believe that we are unique and unpredictable, but we are not. Our behaviors and situations can be fairly accurately predicted by comparing  sample data collected on us to other people.

Andrew Pole, a statistician at Target, did this on an interesting question and was able to predict if a customer was pregnant. His work landed him in a article in the New York Times and gave Target insight into their customers. That is not a bad days work.

New York Times Article

Thursday, February 16, 2012

Jeremy Lin. Passed on by College Recruiters, Missed by Pro Scouts, but identified by a Fedex Driver

This is why I love sports statistics! There are average guys everywhere powering over these numbers with different methods and approaches who come up with conclusions different than the experts and there the ones who get it right.

Such is the case with current NBA darling Jeremy Lin. In 2010 a Fedex Driver crunched the numbers, came to the conclusion that Lin was a star in the making and published his opinion. Back then no one read his article. Today you can not get on the site that posted it. I couldn't either so here is a link to the CNet article

I think one of the reasons that scouts miss guys like this is that they are so constrained by tradition. That can be as simple as Asians do not play good basketball unless they are over 7ft tall, or guys from Harvard do not make it in the NBA ( I am thinking Tommy Amaker will change that). Jeremy Lin's numbers at every level showed he was and is a basketball player.

Friday, February 10, 2012

R Benchmarks on a big data logistic regression

We did some benchmarks of a logistic regression for a customer with a million rows data set. I know it is not a "BIG DATA" problem for many people, but slightly big sounds stupid. We ran the benchmarks on the following setup:

CPU: Intel Xeon X5570 2.93GHz
CPU cores: 8
Network: 10-gigabit ethernet

We used standard open source R 2.14 with glm, Revolution R with glm and Revolution R with "rxlogit". The results were as follows:

R 2.14 with glm: 56 sec
RevoR with glm: 54 sec
RevoR with rxlogit: 7 sec

I can not really provide more information than that, but I think the results are compelling that in some cases Revolution R with rxlogit is far superior in terms of speed for problems of a certain size problems. 

Thursday, February 9, 2012

Oracle R Enterprise goes primetime

Today Oracle announced the release of the commercial version of Oracle R Enterprise. I first heard about this product when it was released as a Beta in 2011. There has often been talk of coupling the wealth of analytics tools in R with the scalability of a database, It is good to see a company like Oracle dips its toe into the water.

Here is the text of the press release:

Oracle R Enterprise

Integrating Open Source R with Oracle Database 11g

Oracle R Enterprise, a component of the Oracle Advanced Analytics Option, makes the open source R statistical programming language and environment ready for the enterprise and big data. Designed for problems involving large amounts of data, Oracle R Enterprise integrates R with the Oracle Database. R users can run R commands and scripts for statistical and graphical analyses on data stored in the Oracle Database. R users can develop, refine and deploy R scripts that leverage the parallelism and scalability of the database to automate data analysis. Data analysts can run R packages and develop and operationalize R scripts for analytical applications in one step—without having to learn SQL. Oracle R Enterprise performs function pushdown for in-database execution of base R and popular R packages. Because it runs as an embedded component of the database, Oracle R Enterprise can run any R package either by function pushdown or via embedded R while the database manages the data served to the R engine.
Here is the post from the Oracle blog on Oracle R Enterprise:

Announcing Oracle R Enterprise 1.0

Analyzing huge data sets presents a challenging opportunity for IT decision makers, driven by the balance between the maintenance and support of existing IT infrastructure with the need to analyze rapidly growing data stores. In many cases, processing this data requires a fresh approach because traditional techniques fail when applied to massive data sets. To extract immediate value from big data, we desire tools that efficiently access, organize, analyze and maintain a variety of data types.
Oracle R Enterprise (ORE), a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition, emerges as the clear solution to these challenges. ORE integrates the popular open-source R statistical programming environment with Oracle Database 11g, Oracle Exadata and the Oracle Big Data Appliance, delivering enterprise-level analytics based on R scripts and parallelized, in-database modeling.
How do R and Oracle R Enterprise work together?
The powerful R programming environment enables the creation of sophisticated graphics, statistical analyses, and simulations. It contains a vast set of built-in functions which may be extended to build custom statistical packages. The R engine is limited by capacity and performance for large data, but with Oracle R Enterprise, R bypasses these contraints by leveraging the database as the analytics engine directly from their R session.
The components that support Oracle R Enterprise include:
1. The Oracle R Enterprise transparency layer - a collection of R packages with functions to connect to Oracle Database and use R functionality in Oracle Database. This enables R users to work with data too large to fit into the memory of a user's desktop system, and leverage the scalable Oracle Database as acomputational engine.
2. The Oracle statistics engine - a collection of statistical functions and procedures corresponding to commonly-used statistical libraries. The statistics engine packages also execute in Oracle Database.
3. SQL extensions supporting embedded R execution through the database on the database server. R users can execute R closures (functions) using an R or SQL API, while taking advantage of data parallelism. Using the SQL API for embedded R execution, sophisticated R graphics and results can be exposed in OBIEE dashboards and BI Publisher documents.
4. Oracle R Connector for Hadoop (ORCH) - an R package that interfaces with the Hadoop Distributed File System (HDFS) and enables executing MapReduce jobs. ORCH enables R users to work directly with an Oracle Hadoop cluster, executing computations from the R environment, written in the R language and working on data resident in HDFS, Oracle Database, or local files.
Using a simple R workflow, R users can seamlessly utilize the parallel processing architecture of ORE and ORCH for scalability and better performance. Analytics and reporting tasks are moved to the Oracle Database, eliminating long approval chains for data movement and dramatically increasing processing speed. R users are not required to learn SQL because the R-to-SQL translation is shipped to the database and processed behind the scenes. The significant benefits to IT include improved data security, data maintenance and audit compliance practices.
My old company Revolution Analytics has been in the business providing commercial R support and tools for almost five years now.  I do not believe the Revolution Analytics model and the Oracle model have anything in common. While I was at Revolution I learned the deep love and strong opinions that the R community have for the project that so selflessly support and grow. It is interesting to read the R-bloggers' perspective on this:

Oracle’s strange understanding of R users

February 8, 2012
(This article was first published on Quantum Forest » rblogs, and kindly contributed to R-bloggers)

After reading David Smith’s tweet on the price of Oracle R Enterprise (actually free, but it requires Oracle Data Mining at $23K/core as pointed out by Joshua Ulrich.) I went to Oracle’s site to see what was all about. Oracle has a very interesting concept of why we use R:
Statisticians and data analysts like R because they typically don’t know SQL and are not familiar with database tasks. R allows them to remain highly productive.
Pardon? It sounds like if we only knew SQL and database tasks we would not need statistical software. File for future reference.

I hope both companies are successful. I believe that the long term survival of commercial software providers hinges on the smart adoption and integration of powerful open source tools.

Tuesday, February 7, 2012

Revolution Analytics gets new CEO

Today I came across a press release from Revolution Analytics that Norman Nie, founder of SPSS, had stepped down as CEO and David Rich has left Accenture to lead the company. As one of the founders of the company and a current stock holder, I am happy to see Revolution continue to bring in talented people and grow.

Here is the press release:

PALO ALTO, Calif. – February 2, 2012 – Revolution Analytics, the leading commercial provider of R software, services and support, today announced that David Rich, who most recently led Accenture Analytics as Global Managing Director, has been named the company’s CEO. Revolution Analytics’ former CEO, industry luminary Dr. Norman Nie, will remain a company director and will serve as Rich’s Senior Advisor for Products and Strategy.
“Revolution Analytics is primed for a different set of proven leadership abilities—someone with the knowledge and relationships to accelerate large-scale adoption of the company’s first-class products within the world’s biggest and most innovative enterprises,” said Norman Nie. “David has the vision, energy, career experience required to bring the company to the next level. Together, we believe our collaboration will help revolutionize the industry in the best interest of our customers.”

Rich is a 28-year veteran of Accenture. In his most recent position as Global Managing Director for Accenture Analytics, he drove the company’s strategy including alliances and investments in predictive analytics solutions. Rich also had global responsibility for Accenture’s CRM Service Line as well as the High Tech Industry. In his other roles, he had national responsibility for the Communications, High Tech and Media industry sectors; he led major client initiatives and developed global alliance strategies at the corporate level. His experience has made him a frequent contributor to BusinessWeek, Forbes, The Financial Times, The Wall Street Journal and other business publications. He holds a B.S. in Management and Technology from the United States Naval Academy.
“Revolution Analytics has emerged as a disruptive player in the analytics space, largely due to Norman’s vision,” said Rich. “He anticipated the challenges that enterprises would face from the rise of big data. He authored the product road map and brokered the technology partnerships that now enable Revolution Analytics to provide orders of magnitude improvement over legacy products in speed, data capacity and price performance. I feel honored that Norman has passed the torch to me. I look forward to working with him, tapping his experience and leading the team’s efforts to revolutionize the adoption of Predictive Analytics at scale.”

First Meeting of the Connecticut R Users Group

The Connecticut R Users Group will have its first meeting in April. I am proud to announce that our first speaker will be Yale Professor and R user John Emerson.  Professor Emerson is an active member of the  R community and for his BigMemory  R Package. He was also a recent speaker at the World Ecomonic Forum in Davos, Switzerland.

Please join us on April 10 at 7pm.

Monday, February 6, 2012

The Great All American Sake.

I enjoy the occasional Sake. On one such occasion I was drinking with a few friends, and I wondered why there were not any Sake Brewers in the United States. What followed was series of unsupported statements about tradition and Sake and how small the market was in the united states for Sake.

I did not think much about it until a couple of weeks later when I actually did some research on the topic. It turns out there are commerical Sake producers in the United States.

First there are the American Sake Brewers own by Japanese Companies.
Ozeki Sake, Inc. started in 1977
Takara Sake USA, Inc. started in 1982
Yaegaki Sake Brewery, Inc. started in 1987
Gekkeikan Sake (USA), Inc started in 1989

Then there is the first wholely owned American Sake Brewery, SakeOne which was founded in 1992. I have had some contact with this company and they have been nothing but help and informative. They are excellent ambassadors of the Japanese art of Sake Brewing and the neighborly spirit of the American Northwest.

Sake brewing has recently moved into Texas, and as usual they have added their own Texas twist to the process.  The Texas Sake Company was founded in Austin in 2011. Their approach to making Sake is different then most in that they only use locally grown organic products in their sake. The rice they use was brought to Texas by a Japanese delegation in 1904. I am looking forward to tasting some of their unique product.

Finally since Sake is a brewed beverage it only makes since that someone would set up a sake brew pub. That the first one in the united states is in Minnesota is amazing. Moto-i will always have the distinction of being America's first sake brew pub. I have spoken to a few people who have gone there, and they universally enjoyed the experience.

Saturday, February 4, 2012

Local Ocean's Fish Farm in New York is World Class!

Last week I visited the Local Ocean's Aquaculture facility in Upstate New York with celebrity chef Bun Lai. I am grateful that the people at Local Ocean took the time out of their busy schedule to show us an absolutely unbelievable facility! This group has figured out how to raise fish that no one else can. The even more impressive thing is they do it in a completely closed loop system that is environmentally friendly.  They represent the future of fish production, and I ca not think of a better group of people to do it.  I have never walked around a facility before where my reaction to everything I saw was that was a brilliant way to do that until I viewed their plant in Hudson, New York.  I can tell how amazing this company was, but their video is better:

The world becomes a better place when smart people build profitable companies that solve real problems. Local Ocean is just such a company. I hope they will expand into Connecticut in the very near future.

Wednesday, February 1, 2012

Yale Professor and R user John Emerson Speaks at the World Economic Forum

Last week Yale Professor John (Jay) Emerson was a featured speaker at the World Economic Forum in Davos, Switzerland. I know Jay as the guy who recently beat me out for the championship of our fantasy football league. People in the Northeast and the R community may know Jay for his many presentations at R Users Groups or his Big Memory Package. Here is the video of his presentation:


It was good to see Jay's work in such an important forum of world leaders.

Wednesday, January 25, 2012

A Visit to a FIsh Farm

On Monday Chef Bun Lai and I traveled up to Turners Falls, MA to visit an aquaculture facility there called Australis.  Aquaculture ( fancy term for fish farming) is a hot topic currently. If done right fish farming produces vast amounts of high quality protein with minimal environmental impact. I was shocked to learn that Seafood is the second largest trade deficit for the United States after oil. Growth of this industry is not only important to establish a reliable source of sustainable food but economically to our country as well.

Australis is headed by Josh Goldman whose love for this business is unrestrained after more than 20 years working in aquaculture. His talks about the importance of this business will bring skeptical college students to their feet like in this youtube video. Josh is no less enthusiastic in person. However, it was his knowledge and the operations of his facility that was the most impressive parts on our tour.

The Australis Plant in Turners Fall produces Baramundi for the fresh fish market in the US. If you see live or fresh Baramundi for sale, it came out of this plant. They raise the fish from eggs to marketable product within the walls their building. Their tanks are closed loop producing little or no waste water to go back into the environment and only organic waste that is ideal fertilizer for the local farming community. This is the way a modern fish farm should be run! A good tasting fish whose production does not lay waste to the environment or consume unreasonable amounts of resources to produce. The selection to produce Baramundi instead of tilapia was no accident, but a long and detailed research project to determine the best species of fish to satisfy not only the consumers but also to satisfy the companies desires to be a responsible user of resources and caretaker of the environment.

Aquaculture is an important growth industry, and in the hands of people like those at Australis it is in good hands.