Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: August 2011

Wednesday, August 24, 2011

Floop the mobile social media opinion app has arrived

Today the app store approved the code for floop which is a mobile social media opinion app. It is now at the app store ready for you to download for free. Let your voice be heard!

Here is the blurb on the facebook page:


When a large number of starfish travel the same path, it’s called a floop. We use the same term to describe people spontaneously coming together to share their opinions on timely topics in sports, politics, fashion, culture, entertainment -- you name it.

Will Cliff Lee pitch 5 more shutouts for the Phillies this season? Are we headed for a “double-dip” recession? Is Lady Gaga’s use of a wheel chair as a stage prop offensive? Floop it and find out what people think.

Also for live events, watch as things change as the action happens. It’s the companion app for TV, sports, radio, etc.

Flooping is fast, easy and fun. Just download floop to iPhone. Touch the screen to check out existing floops or start one of your own. Weigh in by moving the dot up or down and even leave a comment or join in chat. Invite your friends!
 
I have been beta tesing Floop for a while, and it is great fun.

Friday, August 19, 2011

Bun Lai takes Miya's sushi on the road to promote Sustainable Sushi

I know Sushi is not even close to my usual topic, but Bun is a cool guy running an unbelievable sushi place who has decided to try and bring about change in his industry. I have known Bun for close to forty years, and he is completely crazy. His mother used to pack sushi in his lunch bag when we were in elementary school. He runs the best restaurant in New Haven called Miya's,and he happens to make sushi. A couple of years ago Miya's under Bun's direction changed their menu to only carry sustainable sushi. The result was their food got even better! This change was brought about because of conscience, but resulted in improvement because it forced the introduction of new and unique flavors into sushi that had never imagined before. Did I mention that he also makes his own flavored Sake that will dazzle your taste buds and kill your brain cells? Awesome!

Word of Miya's has spread, and afforded Bun the opportunity to spread the word of good responsible sushi. He has jumped at this opportunity to be the Johnny Appleseed of sushi. I could go on, but I will let the Man himself fill you in:

Bun Lai

Next week, Chef Pablo, Nate and I will be leaving via RV to Seattle where I will be a keynote speaker for the American Fisheries Society’s 141st Annual Meeting. We will be teaching my sustainable sushi recipes to sushi chefs from five Seattle restaurants. Original recipes are proprietary and are the chef's gold but ours were created to be shared so that people would be able to eat sushi in a way that was kinder to the planet.  Fifteen years ago, I created the sweet potato sushi roll which can now be found in almost every sushi restaurant in America. The historical significance of the sweet potato roll is that it showed people, for the first time, that sushi didn’t have to be made from seafood. It was a small step towards the sustainable sushi that we do today. Ideas can spread rapidly. With this trip, we plan to be the Johnny Appleseeds of sustainable sushi. We will introduce people to a more sustainable way of thinking about sushi and these ideas will ripple around the world. On the final dinner for the event, I will take 800 dinner guests on a gustatory journey (which will be a lot like being on a time machine); they will experience the future of sushi; made with hyper local sustainably farmed or caught ingredients. During the trip, we will only be purchasing foods that are organic; also, we plan to forage and fish invasive species on a daily basis. Thanks for all of your support! Nobody ever makes a difference without the help and support of many. We couldn’t do what we do without you! Thank you!



:)Bun

Good luck in your travel Bun! Wherever you are cooking I will be there!

Wednesday, August 17, 2011

Cools Presentations/Videos on Redis, R, Hadoop and angel wings

This is much less a post than a link to stuff that I ran across this month that I love.

The first link is to a video presentation by Paul Dix on Reddis. Reddis is awesome! Paul does mention that, but he also goes into explaining why people thought he would be wear Angel's wings to this presentation. I have always known Paul as a straight man, but he is really funny here.

Paul Dix on Angel Wings and Redis


Ryan Rosario of Datajunkie twitter fame and the cool blog Byte Mining. Posted his slides from the Los Angeles RUG on Parallelization in R and Large Data Sets in R. This talk is reason enough to go to the LA RUG if you are ever in town. Ryan sent me this link to his talk because of the little piece I did about R 2.14 including a parallel package. 


Parallelization in R
Large Data Sets in R

I also went to the Greater Boston R user group meeting on July 28. Amy Szczepanski gave a talk on doing High Performance Computing with R which was a great talk. I enjoyed how she went through the evaluation and decision making process for implementing HPC. I had never seen anyone do that in this format, and it was very well done. Amy travels a quick a bit, and I would strongly encourage other R groups to get her on their meeting calenders!


Amy Szczepanski HPC with R talk




Tuesday, August 16, 2011

The Log Cabin goes all SABR in Colorado

One of the blogs I read all the time is The Log Cabin by Ryan Elmore. I particularly like his sports stuff because he looks at questions I had never even thought of. He recently did a talk at the Rocky Mtn SABR Meeting which features sports, R and ggplot2. That is about my vision of a perfect night so I am posting the link below. Enjoy


Slides from Rocky Mtn SABR Meeting

Monday, August 15, 2011

Big Data Analysis can require more than just Hadoop

Bryan Lewis gave a talk last week at the New York CTO Breakfast on Big Data Analytics with a focus on Algorithms, Storage and Communication. I have attached the handout of that talk here for your enjoyment.


Bryan W. Lewis bryan@bigcomputing.com




You’ve got (a lot of) data. What would you like to do with it? Summarize it? Make predictions using it? Can your desired analysis approach work on partitioned data? If not, then what? Is latency important? Are data updated slowly or quickly? What about pragmatic issues like hardware failure?

There has never been a better time to ask such questions than today. I am awed by the capability and variety of possible solutions.

Big data analysis traditionally involved big, vertical software solutions. That is changing. We will discuss some new ideas for analysis of large data, highlighting ideas that are less well-known than they should be. Our discussion is organized across three broad topics: Algorithms, Storage, and Communication.

Algorithms

Consider the well-studied business problem of constructing recommendation systems. The data generally consist of customer attributes, customer purchase histories, customer/product ratings, and other product attributes. They can often amount to many billions of data points, often arranged as very sparse numeric arrays (not every customer buys every item, for example).

The business problem is to use those data to recommend products that customers are likely to like and buy.

Unsupervised machine learning (ML) methods are popular for building recommendation systems, and ML is a very hot topic these days.  Clustering methods are unsupervised ML methods that group customer preferences by identifying clusters of nearby points in plots of the data in appropriate scales. You may have heard of the widely used k-means or k-nearest neighbor clustering methods.

Clustering algorithms tend to exhibit greatly increased computational cost with increasing data size. That can become a serious issue for some product recommendation problems. Even problems with relatively modest storage requirements by today's standards can result in prohibitively expensive computation.

Clustering algorithms can also get confused by some data, reducing the quality of their solutions (recommendations, in this case). Researchers are interested in improving the quality of recommendations without over-specializing models to specific data sets.

Projection methods represent another class of unspervised ML algorithms. They typically reduce or compress the data by eliminating redundancy. The singular value decomposition (SVD) is one of the most important of these methods.

Alex Lin[1] has recently given a number of nice talks for the NYC Predictive Analytics group advocating combinations of SVD and clustering for recommendation systems. The general idea is to compress the data with the SVD and then cluster it. The clustering algorithms require less computational effort on the reduced-sized problems and, as Alex has shown, can exhibit better quality recommendations because the SVD has "cleansed" the data of confusing redundancies.

(Combinations of ML methods are often quite successful, and have been essential to the solution of very challenging problems like the Netflix prize.)

But the SVD is computationally expensive too (although not as much as traditional clustering). Fortunately there are some relatively recent state of the art methods for the efficient computation of the SVD for these kinds of problems. My favorite is the IRLB algorithm by Jim Baglama[2]. The method is presently available in MATLAB and R. It works well with sparse and dense data, it is straightforward to parallelize, and it effectively exploits high-performance linear algebra kernels of modern CPUs and GPUs. It's pretty amazing--I was able to compute a basic partial SVD of the Netflix prize data set in a few minutes on a laptop[3] with it.

Why do we care about a choice of algorithm like IRLB? Let's compare algorithm costs using the Netflix Prize data:

Algorithm
Memory Cost
Computation Cost
Naive SVD
64 GB
5 TFlop
IRLBA
< 1 GB
< 5 GFlop

Of course, we shouldn't (couldn't in some cases) use a naive SVD approach to solve this. IRLBA compares favorably (often about order of magnitude better) to other methods too, such as the one used in Apache Mahout ML library[4] associated with the Hadoop project.

A different set of problematic clustering problems are those that operate on unbounded sets of data. The Java-based Massive Online Analysis[5] project, related to the Weka data mining project, might be more appropriate in such cases. It implements a number of state of the art updating clustering algorithms. The idea is to compute clusters on a subset of data, storing a reduced set of statistics about the clusters, and then update the stats as new data arrives. 

Functions of Graphs

Let's think briefly about a different interesting example: evaluating functions of graphs. Such problems crop up, for example, in social networks, protein folding, communications, and epidemiology.  Some of the basic quantities we might compute on graphs include the number of other nodes connected to each node, how long the network path is between each pair of nodes, measures of graph centrality, and many others. As anyone working at LinkedIn will attest, one can spend a considerable amount of time computing these quantities for large graphs.

Michele Benzi and Paula Boito[6] have recently come up with new methods to very efficiently estimate some of these basic quantities. They've taken some ideas related to the same ideas used for efficient computation of the SVD, but in a different context, and applied them in a clever way to graph problems. The results are remarkable--they estimate quantities within known bounds with several orders of magnitude lower computational effort than other more typically used algorithms.  Their methods are easily parallelized and work well on sparse data.  I expect that these results will be rapidly adopted and expanded to the computation of more complicated functions of graphs.

Storage

The storage of data in ways that promote efficient data analysis is an incredibly active and rapidly developing topic. It covers a lot of ground: from traditional data warehousing and relational databases to new distributed file and NoSQL systems. Let’s start with an example application.

Clinical Looking Glass[7] from Montefiore Medical Center (CLG) is a web-based health care outcomes analysis and decision support application. It provides a wide variety of actuarial and predictive models that work on large scale data (up to millions of patient records over many years). Data resides in traditional relational database systems and can be sliced and diced by queries to support longitudinal or cross-sectional analyses. CLG presents a simple, graphical interface to users web browsers. Concrete examples of the use of CLG include retrospectively reproducing published trials on large real world populations and computing survival models on large populations.

The business challenges for Montefiore include adapting state of the art methodologies to the web-based application framework in a way that can scale easily to handle large problems. Users expect reasonably fast response times, which is a critical design consideration. The system must support the best algorithms implemented in a variety of languages. The analysis functions must be connected to results of the complex queries against the DBMS generated by the web application. Many methods are computationally intensive and require parallel computing implementations to satisfy the fast response design requirement. Intermediate results of computations must occasionally be shared among parallel workers in pipelined computations.  We desire stateful interaction between the web client and analysis engines to allow for user-steered interactive data exploration.

Using a big central DBMS alone was not easily horizontally scalable. We implemented a hybrid solution that supplements existing relational data storage with in-memory storage provided by the Redis NoSQL database, one of my favorites.

Redis is one of a wave of popular, new, NoSQL databases that include Voldemort, Riak, SimpleDB, Cassandra, HBase, Memcached, CouchDB, LevelDB, MongoDB, and many others. Redis a networked key value store that supports really interesting structured values. Stored values can be sets and queues. Redis also provides a publish/subscribe model, uniquely placing it in a convergence of databases and messaging. We chose to use Redis in CLG for its flexibility: we use it both as an in-memory networked data cache and to coordinate parallel computation through work queues. Redis is also simple to use and configure, and works well across OS platforms (both important to us).

Our hybrid DBMS/Redis implementation decouples parallel worker functions from the main DBMS system. The parallel computation is organized in queues, also by Redis. I believe that queues are the best way to organize parallel computing tasks. It’s an old, but good idea: Anonymous worker functions block for work from a generic queue. Master functions place tasks in the generic queue, which is then withdrawn by workers on a first come, first served basis. The master then looks for results in a unique result queue. Workers place results in the appropriate result queue and block for more work, and so on.

This simple approach to organizing parallel computation has a few important features. It’s naturally load balanced and it is elastic by design since workers may be added at any time. Redis additionally makes it trivial to set up fault tolerance so that failed tasks get re-queued. Thus, workers may also be removed at any time. Cross-platform online examples (soon including ready to use EC2 AMIs) may be found here:http://illposed.net/doRedis.html. Compare the simplicity of this approach with the relative complexity of Zookeeper[8], for example.

This implementation has helped improve CLG performance by more than 10x for some data- and computationally intensive algorithms. And, the parallel back-end analysis functions can be run across clusters not directly connected to the main data warehouse.

Clouds of Opportunity

Quantitative financial methods sometimes involve computationally and data-intensive back-testing of models over historic data sets. Back-testing is also an embarrassingly parallel (easy to implement in parallel) problem. We worked with a quant desk at a mid-town hedge fund to help them implement a parallel back-testing strategy to run on a Linux cluster in their data center. Using the ideas described above, they were able to add the significant, and otherwise unused, horsepower from their desktop PCs overnight by signing them up to work queues when they went home. With fault-tolerant job resubmission, the quants simply terminated any running jobs in the morning. In the old days, we called this “resource scavenging.” Redis makes this kind of thing really trivial to implement.

Communication

Mike Kane[9] is working on models for anticipation of trading up/down limits (circuit breakers) under the new SEC rules for financial markets. The models are trained on huge volumes of historic daily stock trade and quote data using Amazon EC2, R and Redis[10]. Once developed, the idea is to apply the model to real time trade and quote data to anticipate an up/down limit and possible associated trading opportunities. Most data processing and parallel computing models can’t respond fast enough to handle trade and quote data in real time. The United States consolidated market exchanges presently average over 200,000 quotes and 28,000 trades per second, with peaks of almost 50,000 trades per second[11]. The latency required to keep up with that stream of market data is on the order of 3ms.

Complex event processing (CEP) is a database turned inside out. Traditional databases run queries against stored data. CEP engines stream data through stored queries. CEP queries can filter and aggregate results in ways similar to traditional database operations. And they are designed to do this quickly with extremely low latency. Mike is working with Sybase/Aleri to adapt sophisticated statistical models from R into the Aleri CEP engine. A single Aleri CEP engine can processes 300,000 events/second with 1.5ms latency[12] on garden variety Intel CPUs.

If we want to embed analytics from languages like R in a CEP engine like Aleri, we only have a few approaches that can keep up with high data rates. One commonly used approach is shared memory--that is linking via compiled code and/or using POSIX or similar shared memory resources across programs. This is the approach that Mike and I took to integrate R with the Esper CEP engine, for example[13].

The shared memory approach is viable, but potentially difficult to implement. It lacks the conceptual appeal and simplicity of queueing approaches described above. One must beware of thread saftey issues, deadlocks, and other potential problems.

An alternate approach uses the 0MQ (Zero-MQ) library. 0MQ is a conceptual extension of sockets. Sockets on steroids. In general, it satisfies what we always wished sockets could do, in a way that is quite simple to use. Endpoints can come and go dynamically, multiple endpoints may be assigned to one “socket,” messages are generally content-oriented (instead of byte-oriented), “sockets” are uniformly defined over multiple fabrics (IPC, networks, etc), pub/sub, queues, and peer to peer communication, and on and on. 0MQ grew out of, but eventually separated totally from, AMQP developed by J. P. Morgan.

0MQ is serverless--every node is a peer. That makes it naturally horizontally scalable. It’s uniquely positioned (among messaging libraries) as a communication framework for parallel/distributed computation. It provides low latency (as low as microseconds depending on network fabric), high throughput (as many as 10s of millions messages/sec) communication, while also not stressing CPU as much as as traditional TCP/IP and some other protocols. Idiomatic 0MQ libraries are available for many languages (I am working on an R implementation presently).

Websockets and Summary

The landscape for analysis of large data today is very diverse. Many more languages and technologies are likely to be involved in your project today than just a few years ago.

Consider web-based applications. We have integrated R and other analysis languages in big vertical web services stacks like IIS/.NET, but integration can be awkward: one must marshal C# data to unmanaged C++ and then to R for analysis and then reverse the whole process to return results. The stateless/RESTful design of many web applications, often advantageous, in our case precludes stateful client/back-end interaction.

The new Websocket API provides a beautiful solution that can directly couple Javascript clients to back end analytic services in a very interactive way. JSON is the natural serialization format between client and analysis service for small data. This lightweight coupling of analytics service and web client totally avoids web services middleware, in our case making everything much simpler.

I am not alone in my enthusiasm for this approach. Kaazing[14] (the original developer of the Websocket API) and Push Technology[15] are two commercial enterprises behind websockets, and their kit is on fire today in financial, gaming, and other markets.

We have discussed the emerging combination of good algorithms in diverse languages, backed by NoSQL and other innovative data storage methods, and organized with queues and modern messaging methods, including Websockets for direct client interaction. It’s a great way to do large data analysis.

Tuesday, August 9, 2011

MLB Attendance - Do fans come to the ballpark because of loyalty or wins?

In the book The Extra 2% by Jonah Keri there is a short discussion about the field success of the Rays not resulting in success at the turnstile. There are many reasons given for this lack of connection between winning and attendance. I also came across an article that claimed that the recent success of the Pittsburgh Pirates is bringing fans back to the the stadium. It got me thinking about what bring fans to the stadium.

So far in the 2011 season the best team has a .652 Winning percentage and the worst team has a .330. While it is true the Phillies with the .652 winning percentage have the highest attendance percentage in baseball, the lowly Houston Astros with the worst winning percentage of .330 are 17th in attendance percentage.  The Chicago Cubs are fifth in home attendance percentage even though they are one of the worst teams in baseball and sixteen games out in their division.

There are also the good teams that do not get fans in the stadium. Division contenders Cleveland and Arizona are ranked 25 and 26 in home attendance. They both fair better on the road with rankings of 11 and 19.  5,000 more people per game attend either team's away games than home games.

So in general winning teams do better in attendance than losing teams, but it is not a hard fast rule. The Chicago Cubs draw fans both home and away in spite of poor on field performance. Winning teams can also fail to attract their local fan base even if they are fighting for a division lead. Even a new ballpark does not seem to influence attendance as Arizona's opened in 1998 and Cleveland's in 1994. Cleveland's Progressive Park was ranked as Major League Baseballs best park in 2008 in a survey done by SI.

The conclusion is that there are multiple factors to bringing fans to the ballpark, and if a teams wants fans in the stadium it must address all those factors. Winning alone will not put fans in the seats unless the locals identify with the brand. Losing alone will not drop a teams attendance if the team's brand is strong enough.

Wednesday, August 3, 2011

Information is leaking out on R 2.14 (rstats)

Yesterday at JSM in Miami, Luke Tierney talked about some of the new features that will be included in R 2.14 (#rstats to you twitter addicts). It is not a surprise given the venue of JSM  and one of the few complaints of R that some of the new features are all about speed, speed, speed.

The first feature is that the byte compiler introduced with R version 2.13 for R developers to use on their own functions with up to a 5X speed up will be included in R 2.14. This means that all standard functions and packages will be pre-compiled into byte-code. This is a god send for R users like me who have had to accept slow code because we could not speed it up by calling C code and operations. It may also end people complaining about the speed of Hadley Wickham's ggplot2. Well, probably not but ggplot gets its fair share of love. I went to a talk by Amy Szczepanski from the University of Tennessee at the last Boston Area useRs group, and her group does a lot of cool HPC stuff which ggplot2 and other packages by Hadley Wickham.

Tierney also said that the new version may make transparent use of parallel processing for some operations on multi-cores. This is great news as I have always felt that parallel processing, and easy or automatic implementation of parallel processing is a critical component to the development of R. The growth of cloud computing has made it even more so.

If you can not wait for the release later this year there is a development version available for download.

Monday, August 1, 2011

Why the US Congress can not fix the budget problem

The financial problems faced by the Federal government are not new ones and have been building for decades. The American people knew they were coming, and those in charge in the federal government were not blinded to the issues. Why were these issues never addressed?

I believe the answer is the design of our federal government and the reasons behind it's particular organization. The federal government of the United States was setup by a group of men who feared tyranny. The men understood that the best way to prevent that kind of problem was to create a government that was weak and inefficient. Even as the federal government has become more powerful over the last 200+ years it has counterbalanced that improvement by becoming more cumbersome and inefficient. The Senate has grown from 26 members to 100 and the House has grown from 64 members to well over 400. In addition Congress costs about $6B a year and increases at a rate of roughly 14%. a true model of inefficiency.

So why can't Congress address the debit crisis? Simply stated because it was never designed to do that. It was designed to prevent tyranny.