2. OLAP-Style Reporting
OLAP is a very traditional approach to reporting that evolved from the data warehouse world. It has crossed over to the data lake world with new products that offer OLAP-style modeling on Hadoop-based data with standard OLAP query interfaces (MDX and XMLA). Once an OLAP model has been defined, users can query data from within Tableau to build reports and dashboards.
The main advantage of connecting Tableau to an OLAP-style model on Hadoop is familiarity. It is a concept Tableau (and other traditional BI tools) know well. The dimensions and metrics in an OLAP model are typically well versed by the business users, making it easy to navigate the structure to find specific answers.
A secondary advantage is that the OLAP model itself can encompass a high number of dimensions and metrics from the larger data sets residing on Hadoop. This provides more data for the Tableau users to work with. However, as we will see under the cons, having access to the data is only part of the problem.
The first problem with this approach is performance. OLAP on Hadoop does not materialize the cubes. It uses virtual models which, when queried, push sub-queries down to Hive to formulate the data and perform calculations. This virtualization process makes the OLAP queries painfully slow.
These products attempt to use methods that try to predict user queries and cache results to speed up performance. But data lakes are most often used for highly diverse discovery queries, which do lend themselves for cache hits that would help performance.
And speaking of data discovery, the whole idea behind data exploration on data lakes is that you don’t necessarily know where to explore, so you need a more open framework to understand and identify relationships and patterns in the data. Creating a set of pre-defined dimensions and metrics restricts the exploration process, thereby limiting where and how an analyst can discover answers.
These two limitations tend to relegate OLAP on Hadoop to situations where an organization is migrating standard reporting from a data warehouse to a data lake, and not performing data discovery. Running many reports across the similar dimensions and metrics can help boost the cache hit rates and thereby performance.
Lastly, introducing another product into your stack adds further disruption to your governance processes. The data preparation solution will have its own governance, and one will separately need to govern the data in Hive. The OLAP on Hadoop platform will then add a third layer of governance, complicating the administration and control of these data sets.
3. Natively Delivered
There is a native columnar format called TDE which is highly efficient at storing data when you want to use it inside Tableau. A data engineering platform such as Datameer can export data at the end of a pipeline into TDE format so it can be natively consumed by Tableau.
The natively delivered solution shrinks your data delivery stack to two components – the data engineering platform and Tableau. The fewer the products in your stack the better for end-to-end performance, integration and governance or administration. Data sets are produced from one best-of-breed product, and delivered or consumed by another. This contributes to a faster overall end-to-end analytic cycle.
This approach also provides increased flexibility for creating different data sets from the same pipeline that can be used and shared across many Tableau users. Within the same pipeline, different output data sets can be crafted, exported to TDE format and consumed by the various users.
Different Tableau users solving diverse analysis problems will need data sets that are organized in varying ways. One analysis may require specific types of aggregations, another might need analytic enrichment with “smart” algorithms and a third might need to be organized into time-series events. A single platform that can accommodate different types of modeling and export to the native Tableau format can deal with the analytic diversity of a modern fata-driven corporation.
The results of this combination makes the data delivered to the Tableau users that much richer, and that much deeper. It makes you more prepared for the problem you’re trying to solve and enables you to start exploring your data in a different way.
As with anything, there is always some con. Historically, Tableau has had size limitations to the with the amount of data consumed with imports, limiting its ability to consume large data sets. Using the TDE format helps improve consumption volume in terms of numbers of rows. Additionally, recent releases have shown Tableau having the ability to handle more data, and this could continue to increase over time.
You also lose the tail end of your governance trail when the data is exported and published to Tableau. But the governance is followed within your data engineering platform like Datameer until the final data set, which helps you secure and track the majority of the cycle.
Here at Datameer, we’ve seen customers use each of these three models. All work and can be effective for specific analytic and reporting scenarios.
But our biggest piece of advice to you is to ensure you have a strong data engineering platform so you have the optimal data sets and don’t have to generically run SQL queries against Hive. Choose a platform that will feed any of those consumption mechanisms to make your life easier.
That way, you can get started faster with creating your visualizations in Tableau—and make it that much easier for data discovery, and that much faster for your business teams to start utilizing the data you’ve been collecting.