Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: February 2012

Monday, February 27, 2012

Data Mining is not new or scary. Target can predict if your pregnant. Walmart can predict when you will buy Pop Tarts and Beer. MIT students can Predict if you are straight or gay

Data Mining is not new or scary. Humans have been collecting information and using that information to better understand human behavior since before recorded history. The only change is that as data storage and computer processing speed has increased the ability of Data Miners to study larger data sets and use more complex model has increased. The UPC codes that are scanned at checkout were put there to allow stores to collect data on their customers and the shopper cards we all use were not created to help us but to allow stores to better track our purchases and behaviors.

The first main stream article I can remember about this was about Walmart addressing the needs of its customers before a major storm. By analyzing the data of stores before major storms Walmart learned that people bought Pop Tarts and Beer. In fact they bought Pop Tarts and Beer at seven times the normal purchasing rate. It is interesting to note they did not just buy Pop Tarts, but Strawberry Pop Tarts. I guess Strawberry just goes better with Beer. Walmart used this information combined with weather data to ship massive amounts of Pop Tarts and Beer to their stores in advance of major storms. The result was the stores did not run out of these products as they had in the past and Walmart increased their revenue.

Here is a Link to that article: What Walmart Knows about their Customers

The recent Target story is really not very different than that the Walmart article of five years before. The additional information that Target had to use as they sifted through the data was that they had the individual customer data. This additional level of detail was achieved by getting customers to use a Target credit card or a shopper card. That enabled Target to link the purchases of particular items to a customer they had information about (age, location, income, etc.). This type of personal information allows retailers to direct promotions to specific customers ( in this case pregnant women) in an effort to increase and maximize revenue.

Here is a Link to that article: How Companies Learn Your Secrets

Retailers have improved their predictive analytics massively in the last decade. Where retailers used to be concerned with figuring out what customers in a region would purchase now are working on what an individual consumer wants to purchase before they walk in the store or go online. Retailers are getting very accurate and specific. A good example of this was the Netflix prize which was an open predictive analytics competition that produced an improved movie recommender for Netflix customer.

None of these efforts are evil in terms of what they were trying to achieve. The goal is to accurately identify the wants and needs of their customers which would result in greater revenues for the retailerand better service for the customer. That is not a bad goal.

Providing the goods people want, and not the ones they don't was a key issue for 7-11 stores. They have achieved this through years of data point of sale collection with great effect. Here is a case study on data analytics and 7-11. Gone are the days of stale out of date food at 7-11 that no one will ever buy.

However, there is a potential dark side to all this data collection and predictive analytics on it. It can be used in to discover things about individuals that the did not want or agree to reveal to others. It can be abused.

A Research paper by two MIT students showed that by examining the friends of a person on Facebook one could predict the sexual orientation of that person. The problem arises when that individual is not ready to expose their sexuality to the outside world. Similar to when Target revealed to a father of a teenage girl that she was pregnant. That young girl may not have wanted her father to be aware of her pregnancy. Here is a link to the MIT paper.

Recently the American Civil Liberties Union has expressed concerns about the data collected from traffic light cameras. Apparently this data becomes available to both the government and the company the government engages to collect the data.  It also can become available to anyone who requests it through the Freedom of Information Act. Their concern is that here data is being collected without the permission of the individual. This is different from the Target situation because customers were providing their personal information to Target. This is also the case in License plate scanning that is done by a number of cities and towns in Connecticut. The ACLU of Connecticut has filed suit to force the towns to periodically purge the data and have reasonable controls on it for privacy.

Data mining and predictive analytics are here to stay because they are powerful and useful tools to the organizations that use them. In many cases they provide insights and results that not only benefit the organizations that employ them but to the community at large. If you feel that you do not want to be part of this system than do not participate. How? That is easy.  Buy everything with cash. Do not uses any frequent shopper cards. Do not use/own a cell phone, and stay away from the internet. There is a price to opting out of the system just are there is a price for choosing to be part of the system, but it can be done.

Wednesday, February 22, 2012

SAS versus R - The longest discussion on Linkedin I have ever seen

Six months ago Oleg Okun asked the posed the following question to the Advanced Business Analytics, Data Mining and Predictive Modeling Group on Linkedin:

SAS versus R

Did anyone have to justify to a prospect/customer why R is better than SAS? What arguments did you provide? Did your prospect/customer agree with them? Why do you think, despite being free and having a lot of packages, R is still not a favorite in Data Mining/Predictive Analytics in the corporate world?

I responded to the original question and engaged in a some discussion as time has gone by. It has been fun and interesting. The range and breath of this discussion thread, and the number of participants is amazing. It has never gone stale and there are new contributor every day. The most recent topics include Oracle R Enterprise which was not even in existence when the question was originally posed by Oleg.

Here is a sample of the Discussion:

John Charnes • @Daniel Lieb -- Hope all's well with you, Dan. Yes, the dig on R has been limits on the size of data sets that it can handle. However, I listened to a webcast last week that described Oracle R Enterprise. The speakers described analyses of terabytes of data by running existing R scripts directly against data stored in Oracle Database 11g. I haven't used it yet myself, but is definitely worth checking out.

Ashish Patel • Support and security is the prime concern for Corporates when it comes to Open source tool adoption..
Last thing business want is tool failure resulting in Business bottom line impact...
Also Enterprisewide configuration and updates management plays crucial role too...
Ex: after Red Hat started providing support for Linux , enterprise adoption increased..

David Tussey • A side comment, I see Python and the various packages (SciPy, NumPy) as the emerging winner in this battle. Python is much easier to program than R, and more "data friendly" when it comes to file and database manipulation.

Greg Sterijevski • Some general comments on SAS vs R:

R: tons of estimation technique, experimental as well as tried and true ones
SAS: not as many techniques, a long experimental dev stage before a proc is considered finalized

Usage Consistency:
R: None or minimal, each package is its own point process
SAS: Very consistent use across procs and packages. If a class statement is supported by a proc, then it works identically to the class statement in any other proc. Concommitantly, all elements of a technique are typically fleshed out before becoming production. R might have an estimation technique 'XYZ', in SAS that technique is typically not considered finished unless it implements a slew of ancillary functionality (like hypothesis testing, parameter restriction, hccme covariance matrices)

R: pretty good coverage for base packages
SAS: a humongous test library accumulated over 20+ years. A consistent, if not complete, testing philosophy.

Numeric Consistency:
R: Not sure, have only run R on Win and Linux
SAS: Numerically 'close' results on platforms from mainframe to PC. Furthermore, an almost slavish insistence on numeric results not changing for a given technique from release to release. In other words, if your actions change the results (even insignificantly) of benched tests, you'd better have a very good reason.

Output reusability:
R: Excellent. The techniques typically yield a structure which whose members can be accessed in a very natural way. One can chain together strings of calcs to come up with much more complicated 'meta models' very easily.

SAS: Start of the art in 1980. I haven't kept up with their improvements in the last few years, but the only reasonable way to capture and reuse output use d to be an OUT= statement (if supported) or snatch output from the ODS. SAS/IML is a bit of standout in that it works a bit like R or Python, but the feature set is not as complete.

Overall, there is not a clear winner. If your client is looking for a supported, vetted analytics engine then SAS edges out the competition. If you are a startup with a bit more time than money, R or even an open source library like Apache Math Commons, Mahout, or NumPy will do. For the organization which is not afraid of bit of coding, the open source solution offers the ability to tailor their analytic system to the business need.
Alfredo Roccato • In the real world (I'm speaking of large commercial organizations) where 80%-90% of the time is spent in large scale data processing, SAS has proven to be a very efficient and flexible tool. In an academic contest, where most of time is spent in analysis, mainly dealing with toys data, no doubt that R is the preferred software. In my opinion these packages do not compete each other, even if there is a considerable overlap for statistical methodologies. Rather, a better communication would benefit both: you can use SAS for complex data manipulation and R for all the analyses written by the moltitude of its contributors.

If you are on Linkedin join the group and the "SAS versus R" thread.


Marie Colvin, journalist, who was killed in Syria was from Long Island and a Yale graduate

News of two more journalists deaths in Syria passed by me as just another number last night. In the sad state of Syria today two more dead is a small number even by the a daily body count.

However, this morning it hit me that these are no bodies these are people. So who was this American journalist who was killed by shelling in Syria on February 21, 2012? Here is her last report:

She was Marie Colvin. She grew up in Oyster Bay Long Island where she did well in school which allowed her to enter Yale in 1980. She graduated from Yale in 1984 and immediately enter journalism. She loved her job and excelled at it. Over the next 20 years Marie covered all the hot spots around the world, but especially the middle east. She was a true war reporter. She was in the documentary film Bearing Witness in 2005.

In 2001 Marie lost her eye to shrapnel in Sri Lanka in 2001. She lost her life to artillery in 2012.

Monday, February 20, 2012

Brewing my own Sake. First attempt....

Most guys who get into home brewing brew their own beer or make their own wine. There are lots of resources out there and people to talk to about how to make a good product. That would be too easy. I have always wanted to brew something at home, but why do something easy. A couple of my friends tried to brew sake and they said their results were undrinkable. When I can not do worse than those who tried before me I have found my calling, home brewing sake.

The only resource I have found to aid in my efforts of home brewing sake is Will Auld of HomeBrewSake. I have ordered their home brew sake kit that comes from the All American Sake Company better known as SakeOne. Now I wait for the material to come to start the roughly ninety day brewing process.

Remember I am only trying to make something that is drinkable to me. Given that I was able to drink the Applejack that I brewed in my dorm room back at Choate I figure i should be able to drink just about anything.

Friday, February 17, 2012

Target knows when you are pregnant and a statistician gets credit for it in the New York Times

I do not believe that anyone is unaware that retail companies like Target collect and analyze data of their customers. I do think the amazing thing is the type of things that can be learned from analysis of this data and the accuracy of that insight. We are human and therefore want to believe that we are unique and unpredictable, but we are not. Our behaviors and situations can be fairly accurately predicted by comparing  sample data collected on us to other people.

Andrew Pole, a statistician at Target, did this on an interesting question and was able to predict if a customer was pregnant. His work landed him in a article in the New York Times and gave Target insight into their customers. That is not a bad days work.

New York Times Article

Thursday, February 16, 2012

Jeremy Lin. Passed on by College Recruiters, Missed by Pro Scouts, but identified by a Fedex Driver

This is why I love sports statistics! There are average guys everywhere powering over these numbers with different methods and approaches who come up with conclusions different than the experts and there the ones who get it right.

Such is the case with current NBA darling Jeremy Lin. In 2010 a Fedex Driver crunched the numbers, came to the conclusion that Lin was a star in the making and published his opinion. Back then no one read his article. Today you can not get on the site that posted it. I couldn't either so here is a link to the CNet article

I think one of the reasons that scouts miss guys like this is that they are so constrained by tradition. That can be as simple as Asians do not play good basketball unless they are over 7ft tall, or guys from Harvard do not make it in the NBA ( I am thinking Tommy Amaker will change that). Jeremy Lin's numbers at every level showed he was and is a basketball player.

Friday, February 10, 2012

R Benchmarks on a big data logistic regression

We did some benchmarks of a logistic regression for a customer with a million rows data set. I know it is not a "BIG DATA" problem for many people, but slightly big sounds stupid. We ran the benchmarks on the following setup:

CPU: Intel Xeon X5570 2.93GHz
CPU cores: 8
Network: 10-gigabit ethernet

We used standard open source R 2.14 with glm, Revolution R with glm and Revolution R with "rxlogit". The results were as follows:

R 2.14 with glm: 56 sec
RevoR with glm: 54 sec
RevoR with rxlogit: 7 sec

I can not really provide more information than that, but I think the results are compelling that in some cases Revolution R with rxlogit is far superior in terms of speed for problems of a certain size problems. 

Thursday, February 9, 2012

Oracle R Enterprise goes primetime

Today Oracle announced the release of the commercial version of Oracle R Enterprise. I first heard about this product when it was released as a Beta in 2011. There has often been talk of coupling the wealth of analytics tools in R with the scalability of a database, It is good to see a company like Oracle dips its toe into the water.

Here is the text of the press release:

Oracle R Enterprise

Integrating Open Source R with Oracle Database 11g

Oracle R Enterprise, a component of the Oracle Advanced Analytics Option, makes the open source R statistical programming language and environment ready for the enterprise and big data. Designed for problems involving large amounts of data, Oracle R Enterprise integrates R with the Oracle Database. R users can run R commands and scripts for statistical and graphical analyses on data stored in the Oracle Database. R users can develop, refine and deploy R scripts that leverage the parallelism and scalability of the database to automate data analysis. Data analysts can run R packages and develop and operationalize R scripts for analytical applications in one step—without having to learn SQL. Oracle R Enterprise performs function pushdown for in-database execution of base R and popular R packages. Because it runs as an embedded component of the database, Oracle R Enterprise can run any R package either by function pushdown or via embedded R while the database manages the data served to the R engine.
Here is the post from the Oracle blog on Oracle R Enterprise:

Announcing Oracle R Enterprise 1.0

Analyzing huge data sets presents a challenging opportunity for IT decision makers, driven by the balance between the maintenance and support of existing IT infrastructure with the need to analyze rapidly growing data stores. In many cases, processing this data requires a fresh approach because traditional techniques fail when applied to massive data sets. To extract immediate value from big data, we desire tools that efficiently access, organize, analyze and maintain a variety of data types.
Oracle R Enterprise (ORE), a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition, emerges as the clear solution to these challenges. ORE integrates the popular open-source R statistical programming environment with Oracle Database 11g, Oracle Exadata and the Oracle Big Data Appliance, delivering enterprise-level analytics based on R scripts and parallelized, in-database modeling.
How do R and Oracle R Enterprise work together?
The powerful R programming environment enables the creation of sophisticated graphics, statistical analyses, and simulations. It contains a vast set of built-in functions which may be extended to build custom statistical packages. The R engine is limited by capacity and performance for large data, but with Oracle R Enterprise, R bypasses these contraints by leveraging the database as the analytics engine directly from their R session.
The components that support Oracle R Enterprise include:
1. The Oracle R Enterprise transparency layer - a collection of R packages with functions to connect to Oracle Database and use R functionality in Oracle Database. This enables R users to work with data too large to fit into the memory of a user's desktop system, and leverage the scalable Oracle Database as acomputational engine.
2. The Oracle statistics engine - a collection of statistical functions and procedures corresponding to commonly-used statistical libraries. The statistics engine packages also execute in Oracle Database.
3. SQL extensions supporting embedded R execution through the database on the database server. R users can execute R closures (functions) using an R or SQL API, while taking advantage of data parallelism. Using the SQL API for embedded R execution, sophisticated R graphics and results can be exposed in OBIEE dashboards and BI Publisher documents.
4. Oracle R Connector for Hadoop (ORCH) - an R package that interfaces with the Hadoop Distributed File System (HDFS) and enables executing MapReduce jobs. ORCH enables R users to work directly with an Oracle Hadoop cluster, executing computations from the R environment, written in the R language and working on data resident in HDFS, Oracle Database, or local files.
Using a simple R workflow, R users can seamlessly utilize the parallel processing architecture of ORE and ORCH for scalability and better performance. Analytics and reporting tasks are moved to the Oracle Database, eliminating long approval chains for data movement and dramatically increasing processing speed. R users are not required to learn SQL because the R-to-SQL translation is shipped to the database and processed behind the scenes. The significant benefits to IT include improved data security, data maintenance and audit compliance practices.
My old company Revolution Analytics has been in the business providing commercial R support and tools for almost five years now.  I do not believe the Revolution Analytics model and the Oracle model have anything in common. While I was at Revolution I learned the deep love and strong opinions that the R community have for the project that so selflessly support and grow. It is interesting to read the R-bloggers' perspective on this:

Oracle’s strange understanding of R users

February 8, 2012
(This article was first published on Quantum Forest » rblogs, and kindly contributed to R-bloggers)

After reading David Smith’s tweet on the price of Oracle R Enterprise (actually free, but it requires Oracle Data Mining at $23K/core as pointed out by Joshua Ulrich.) I went to Oracle’s site to see what was all about. Oracle has a very interesting concept of why we use R:
Statisticians and data analysts like R because they typically don’t know SQL and are not familiar with database tasks. R allows them to remain highly productive.
Pardon? It sounds like if we only knew SQL and database tasks we would not need statistical software. File for future reference.

I hope both companies are successful. I believe that the long term survival of commercial software providers hinges on the smart adoption and integration of powerful open source tools.

Tuesday, February 7, 2012

Revolution Analytics gets new CEO

Today I came across a press release from Revolution Analytics that Norman Nie, founder of SPSS, had stepped down as CEO and David Rich has left Accenture to lead the company. As one of the founders of the company and a current stock holder, I am happy to see Revolution continue to bring in talented people and grow.

Here is the press release:

PALO ALTO, Calif. – February 2, 2012 – Revolution Analytics, the leading commercial provider of R software, services and support, today announced that David Rich, who most recently led Accenture Analytics as Global Managing Director, has been named the company’s CEO. Revolution Analytics’ former CEO, industry luminary Dr. Norman Nie, will remain a company director and will serve as Rich’s Senior Advisor for Products and Strategy.
“Revolution Analytics is primed for a different set of proven leadership abilities—someone with the knowledge and relationships to accelerate large-scale adoption of the company’s first-class products within the world’s biggest and most innovative enterprises,” said Norman Nie. “David has the vision, energy, career experience required to bring the company to the next level. Together, we believe our collaboration will help revolutionize the industry in the best interest of our customers.”

Rich is a 28-year veteran of Accenture. In his most recent position as Global Managing Director for Accenture Analytics, he drove the company’s strategy including alliances and investments in predictive analytics solutions. Rich also had global responsibility for Accenture’s CRM Service Line as well as the High Tech Industry. In his other roles, he had national responsibility for the Communications, High Tech and Media industry sectors; he led major client initiatives and developed global alliance strategies at the corporate level. His experience has made him a frequent contributor to BusinessWeek, Forbes, The Financial Times, The Wall Street Journal and other business publications. He holds a B.S. in Management and Technology from the United States Naval Academy.
“Revolution Analytics has emerged as a disruptive player in the analytics space, largely due to Norman’s vision,” said Rich. “He anticipated the challenges that enterprises would face from the rise of big data. He authored the product road map and brokered the technology partnerships that now enable Revolution Analytics to provide orders of magnitude improvement over legacy products in speed, data capacity and price performance. I feel honored that Norman has passed the torch to me. I look forward to working with him, tapping his experience and leading the team’s efforts to revolutionize the adoption of Predictive Analytics at scale.”

First Meeting of the Connecticut R Users Group

The Connecticut R Users Group will have its first meeting in April. I am proud to announce that our first speaker will be Yale Professor and R user John Emerson.  Professor Emerson is an active member of the  R community and for his BigMemory  R Package. He was also a recent speaker at the World Ecomonic Forum in Davos, Switzerland.

Please join us on April 10 at 7pm.

Monday, February 6, 2012

The Great All American Sake.

I enjoy the occasional Sake. On one such occasion I was drinking with a few friends, and I wondered why there were not any Sake Brewers in the United States. What followed was series of unsupported statements about tradition and Sake and how small the market was in the united states for Sake.

I did not think much about it until a couple of weeks later when I actually did some research on the topic. It turns out there are commerical Sake producers in the United States.

First there are the American Sake Brewers own by Japanese Companies.
Ozeki Sake, Inc. started in 1977
Takara Sake USA, Inc. started in 1982
Yaegaki Sake Brewery, Inc. started in 1987
Gekkeikan Sake (USA), Inc started in 1989

Then there is the first wholely owned American Sake Brewery, SakeOne which was founded in 1992. I have had some contact with this company and they have been nothing but help and informative. They are excellent ambassadors of the Japanese art of Sake Brewing and the neighborly spirit of the American Northwest.

Sake brewing has recently moved into Texas, and as usual they have added their own Texas twist to the process.  The Texas Sake Company was founded in Austin in 2011. Their approach to making Sake is different then most in that they only use locally grown organic products in their sake. The rice they use was brought to Texas by a Japanese delegation in 1904. I am looking forward to tasting some of their unique product.

Finally since Sake is a brewed beverage it only makes since that someone would set up a sake brew pub. That the first one in the united states is in Minnesota is amazing. Moto-i will always have the distinction of being America's first sake brew pub. I have spoken to a few people who have gone there, and they universally enjoyed the experience.

Saturday, February 4, 2012

Local Ocean's Fish Farm in New York is World Class!

Last week I visited the Local Ocean's Aquaculture facility in Upstate New York with celebrity chef Bun Lai. I am grateful that the people at Local Ocean took the time out of their busy schedule to show us an absolutely unbelievable facility! This group has figured out how to raise fish that no one else can. The even more impressive thing is they do it in a completely closed loop system that is environmentally friendly.  They represent the future of fish production, and I ca not think of a better group of people to do it.  I have never walked around a facility before where my reaction to everything I saw was that was a brilliant way to do that until I viewed their plant in Hudson, New York.  I can tell how amazing this company was, but their video is better:

The world becomes a better place when smart people build profitable companies that solve real problems. Local Ocean is just such a company. I hope they will expand into Connecticut in the very near future.

Wednesday, February 1, 2012

Yale Professor and R user John Emerson Speaks at the World Economic Forum

Last week Yale Professor John (Jay) Emerson was a featured speaker at the World Economic Forum in Davos, Switzerland. I know Jay as the guy who recently beat me out for the championship of our fantasy football league. People in the Northeast and the R community may know Jay for his many presentations at R Users Groups or his Big Memory Package. Here is the video of his presentation:


It was good to see Jay's work in such an important forum of world leaders.