In February 2014, UC Berkeley AMPLab published their latest benchmark comparing the performance of Amazon Redshift, Apache HIVE, AMPLab Shark, Cloudera Impala and Apache Stinger/Tez; the latest benchmark included more recent versions of each analytics framework and added Apache Tez to the test group. As one would expect, the latest versions of Redshift, Impala and Shark provide increases in performance across the majority of test cases, with the greatest performance improvements present with the largest data sets. While the scan query performance of Shark’s in-memory tests did regress for Scan Query, the benchmark does position Shark as a strong general purpose distributed SQL query engine.
Shark runs on Apache Spark, the recently graduated top-level Apache foundation project and high performance in-memory general purpose distributed computing platform. Spark’s ability to cache results in-memory makes it particularly efficient for iterative processing as is the case with machine learning problems. The tests conducted as part of the benchmark clearly do not play to the strength of the underlying Spark platform, but level the playing field between the competing on-disk and in-memory platforms. I took this as a clear indication that Spark is being positioned as a general purpose computing platform rather than a niche platform for machine learning.
The positioning of Spark is further reinforced by Matei Zaharia, CTO of datastax, in a Quora discussion in which he outlines the value of being able to perform ETL, run a training algorithm and run a report all on the same in-memory data using a consistent programming interface.
A number of reports have commented on how Spark’s rise will impact the future of MapReduce. Of note is Derrick Harris’ recent article on Gigaom As MapReduce fades, Apache Spark is now a top-level project: “MapReduce was fun and pretty useful while it lasted, but it looks like Spark is set to take the reins as the primary processing framework for new Hadoop workloads”. Clearly the future of MapReduce looks a little bleak.
In the future, it is quite possible that the UC Berkeley AMPLab benchmark will be extended to include iterative tests. If this happens, an inability to pin intermediate results in memory will put an analytics framework at a disadvantage to Shark running on Spark. Two engineers at Cloudera could see how important it was to provide in memory-caching within Apache HDFS for the future adoption of Impala. In July 2013, they started implementing it . At the end of January 2014, the work was completed and Cloudera confirmed that it would be part of Hadoop 2.3.0 and CDH 5.0. Clearly this was an important and necessary addition to HDFS.
2014 is shaping up to be full of innovation within the field of data analytics, not just within the foundational computing frameworks, but in the areas of data preparation, visualisation and machine learning. Next week’s Gigaom's Structure Data conference has a great line-up of speakers and has the promise of being a fantastic two days. I for one am eagerly looking forward to it.