One of the most exciting new features of HDP 2.6 from Hortonworks was the general availability of Apache Hive with LLAP. If you missed DataWorks Summit you’ll want to look at some of the great LLAP experiences our users shared, including Geisinger who found that Hive LLAP outperforms their traditional EDW for most of their queries, and Comcast who found Hive LLAP is faster than Presto for 75% of benchmark queries.
These great results are thanks to performance and stability improvements Hortonworks made to Hive LLAP resulting in 3x faster interactive query in HDP 2.6. This blog dives into the reasons HDP 2.6 is so much faster. We’ll also take a look at the massive step forward Hive has made in SQL compliance with HDP 2.6, enabling Hive to run all 99 TPC-DS queries with only trivial modifications to the original source queries.
Starting Off: 3x Performance Gains in HDP 2.6 with Hive LLAP
Let’s start out with a summary of runtimes between Hive LLAP on HDP 2.5 versus HDP 2.6, on an identical 9 node cluster (details at the end of the doc), using queries from the TPC-DS suite as used in previous benchmarks. Because of SQL gaps in HDP 2.5 and older versions, these queries had re-writes versus the source TPC-DS queries which are no longer needed in HDP 2.6 and are therefore referred to as “legacy queries”.
A few key callouts:
- HDP 2.6 runs this set of queries in 5155.3 seconds, compared to 14983.4 seconds for HDP 2.5, almost a 3x improvement for HDP 2.6.
- HDP 2.6 is faster than HDP 2.5 on almost all queries.
- HDP 2.6 is more than 5x faster than HDP 2.5 on 8 queries.
Now let’s dive into the reasons that HDP 2.6 is so much faster than HDP 2.5.
Hive Working Smarter, Not Harder
Hive’s core execution engine and operators are highly optimized, and designed to work end-to-end with the ORCFile columnar format and LLAP’s in-memory cache. Given this level of optimization, the majority of Hive’s performance improvements in HDP 2.6 come from more sophisticated cost-based optimization along with dynamic runtime features that minimize the overall work Hive must do to satisfy a query. You can say that with HDP 2.6, Hive is working smarter and not harder, re-using its own work or avoiding it altogether when possible.
Hive LLAP Supports the Full Suite of 99 TPC-DS Queries
Past Hive benchmarks have focused on a subset of the TPC-DS queries. In large part this was because older versions of Hive lacked SQL features used by TPC-DS queries. With HDP 2.6 Hive is able to run all 99 TPC-DS queries with only trivial modifications (defined as simple, mechanical rewrites such as changing column names/aliases, adding columns to the select list and other simple transformations.) For reference we’ve posted the full set of trivially-modified queries used in the the remainder of this blog post in the updated Hortonworks testbench repository.
Comparing Apache Hive to Apache Impala
Apache Impala is a SQL-on-Hadoop engine built specifically for interactive query. For this benchmark we compared Hive to Impala using the identical set of 99 trivially-modified TPC-DS queries, which can be found in our GitHub repository.
To summarize the results, the aggregate runtime for all queries is similar across the two engines, but Hive is able to run all 99 queries compared to only 60 in Impala.
This graph breaks down a full comparison of all runtimes, across the 60 queries which could be run in both engines.
When we drill into why Impala is only able to run 60 out of the full suite of 99 queries, this is what we find:
These missing features are not obscure SQL features. Instead they are features offered by almost all commercial SQL products and an ever-growing list of open-source SQL tools like Apache Hive. Some of these limitations can be dealt with on a query-by-query basis (such as lack of a date datatype), but such changes are impractical across an entire SQL suite or workload. Other Impala limitations require extremely difficult workarounds, such as missing support for rollups or intersections. Overall, Hive offers much better workload portability if you are considering offloading workloads from a legacy EDW solution.
Hive LLAP Performance Across All 99 Queries
For reference, this graph shows Hive performance across all 99 trivially-modified TPC-DS queries.
Hardware Configuration (Used in All Test Configurations)
9 worker nodes, each with:
- 256 GB RAM
- Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
- 2x HGST HUS726060AL4210 A7J0 disks for HDFS and YARN storage
- Cisco VIC 1227 10 Gigabit Network Connection
HDP 2.6 Software Configuration
- Ambari-managed HDP 2.6.2 stack
- Hive version = 22.214.171.124.6.2.4-4
- Tez version = 0.8.4.2.6.2.4-4
- LLAP container size: 180GB (per node)
- LLAP Heap Size: 128GB (per daemon)
- LLAP Cache Size: 32GB (per daemon)
- OS Setting: net.core.somaxconn set to 16k, ntpd and nscd running
- Data: ORCFile format, scale 10,000, with daily partitions on fact tables (load scripts available in the GitHub repository).
- Note: This HDP build will also be available in Hortonworks Data Cloud for AWS as version 1.16.5 within 7 days from this blog publication.
CDH 5.12 Software Configuration
- CDH-managed CDH 5.12 stack
- Impala version = 2.9
- Data: Parquet format, scale 10,000, with daily partitions on fact tables.