Premise: A number of Wikibon clients have asked about the state of big data benchmarks. Answer? We’ve hardly seen any standardized performance benchmarks in big data. Nor will we see any real benchmarks until the industry coalesces around common workloads.
Some big data vendors have made early attempts at performance benchmarks, but they are either woefully incomplete or misleading. Why is big data benchmarking so problematic? The closest we’ve come to a common workload definition is TPC-DS, which benchmarks SQL decision support and can be targeted at Hadoop. Spark is especially hard to benchmark meaningfully because each new point release enables new classes of complex workloads, which are neither cheap nor easy to translate into benchmarks. If you’re considering using a big data benchmark, you need to be aware that:
- Attempts at using existing, standard benchmarks have been ineffective.
- It’s too early to design new benchmarks.
- The product benchmarks that have been published typically aren’t useful.
Attempts at Using Existing Benchmarks Are Ineffective.
The Transaction Processing Council (TPC) has been creating transaction processing and decision support benchmarks for decades. The benchmarks are carefully curated by the TPC; official TPC benchmarks cannot be published without the TPC’s permission. The most widely used decision support benchmark is the SQL-based TPC-DS 2.x, which is a version of the original TPC-DS benchmark — modernized for volumes of 1-100TB.
Most attempts at benchmarking big data systems have used TPC-DS 2.x running against SQL on Hadoop. A few SQL on Hadoop vendors have published TPC-DS benchmark results, but the products under test are still so immature that none that we know of actually can run all 90+ queries in the TPC-DS test suite unmodified.
It’s Too Early to Design New Benchmarks.
It’s worth reiterating that Industry standard benchmarks depend on widespread use of common workloads. For the most part, “common big data workload” remains an alien concept. Yahoo! designed YCSB to be a benchmark for key-value scale-out NoSQL databases. It was created in the 2010 timeframe, but seems to be losing favor because scale-out NoSQL databases are deployed in widely varying usage scenarios.
Spark is especially hard to benchmark right now because the workloads for which it is appropriate are expanding rapidly with each major release and even with some minor releases. For example, Spark Streaming in release 2.0 can only read from and write to files. That hardly qualifies it as an end-to-end streaming system. Databricks participated in a simple benchmark comparing sorting speed for 100TB and 1PB of records two years ago which showed orders of magnitude greater speed than MapReduce. But its applicability is too narrow to be much more than cosmetic.
The biggest change coming to databases of all stripes is the convergence of transaction processing and decision support. We don’t yet have any good benchmarks for those use cases, probably because the workloads are just emerging.
Benchmarks Do Get Published, But Read the Fine Print.
Occasionally a SQL on Hadoop product vendor publishes TPC-DS benchmarks, but results typically are so heavily footnoted and qualified that Wikibon clients generally don’t find these results to be projectable to their specific workloads. Of course, any benchmark data is better than no benchmark data, but in the big data world, users need to very clear on how they generalize benchmark results. For an example, see:
Cloudera Impala
Cloudera designed Impala to address interactive business intelligence and reporting workloads. Yet in the write-up on the benchmark they compare the product to Hortonwork’s Hive-on-Tez and Spark SQL. Hive-on-Tez’s sweet spot is primarily ETL and production reporting workloads. And Spark SQL is also focused primarily on batch workloads that integrate with its streaming and machine learning API’s. The comparisons aren’t all that useful.
Pivotal HAWQ
The post includes the caveat “this is not an official TPC-DS for a variety of reasons.” One of those reasons was that Pivotal was trying to find a subset of queries that each product under test, like Hive and Impala, could complete.
Hortonworks Hive on Tez
http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
This benchmark was based on production reporting jobs run by Yahoo! Japan. This type of workload is squarely within Hive’s sweet spot and most definitely outside of Impala’s original focus of interactive business intelligence. Since it’s not based on any 3rd-party benchmark standards, the exercise is really just a test of which product works best for a particular customer’s workload.
Spark
Spark is especially hard to benchmark right now because the workloads for which it is appropriate are expanding rapidly with each major release and even with some minor releases. For example, Spark Streaming in release 2.0 can only read from and write to files. That hardly qualifies it as an end-to-end streaming system. Databricks participated in a simple benchmark of sorting performance a year ago but its applicability is so narrow it’s mostly cosmetic.
Flink, Spark, and Storm Streaming
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
Yahoo did a comparison of Flink, Spark, and Storm streaming for an online ad scenario this past December, but it isn’t considered general enough to be a widely adopted benchmark.
Action Item
Big data products and workloads remain too immature for standard benchmarks capable of providing truly comparable results. Users that need benchmarks for contracting or comparison purposes should run their own stylized workloads based on their intended usage scenarios.