Big Data Adhoc Query Architectures

In this post I try to summarize certain approaches that can be followed to build an Adhoc query system on top of an Hadoop ecosystem. There seems to a trade off between speed and complexity. Which is basically "How much you can do versus how fast you can do". 

This post can be considered a followup to my previous entry on this topic which can be found here

In almost all cases the application of Big Data is to derive insights. As a result ability to query, correlate and analyze the data resident in Hadoop (HDFS) becomes the primary motivation to build an Adhoc Query system on top of it. Much like a traditional database running SQL. But this DB system would practically require no update kind of use cases to be supported. It is a WORM (Write Once Read Many) system.

The most ideal way would be to bypass any intermediate tooling and query data directly off the FS/DFS. But to make it happen efficiently Data shall need to be schematized and stored appropriately in the first place. And this creates the maximum challenge. How data should be organized, indexed and co-located with other data are the key questions to be answered. But this approach has been successful given both speed and complexity to be addressed. Google Dremel and its clones such as Cloudera Impala or Apache Drill show maximum promise in the open source arena.

Following material is a high level abstract on possible approaches towards building an Adhoc Query Infrastructure on Big Data using Hadoop and ecosystem.

1. Introduction

 2. Logical Architecture


3. How Current Options Fit In

4. Three Reference Architectures




 5. Unified Approaches



As it can be observed that Hive and further improvements on top of the same may be a way forward. Even if it just means reusing the QL Grammar and Meta Data management and replacing the core execution engine to something more lean and node local (Impala, Tez). HBase plus SQL initiatives come a close second in my opinion for the simple reason that HBase has primarily been designed for storing and querying large amounts of data efficiently and not really for analytic purposes. Hence analytic workloads would require HBase data to be organized and reorganized in multiple formats (or Tables) to make even simple analysis like correlations (JOIN) possible. This in turn adds to the storage and computing costs.

Coming to a last but probably the most important point is about Storage Formats. Query systems require data to be stored and organized (schematized) in special formats. But this should be as open (open source or at least open specification) as possible so that multiple engines can query and process the data in their different ways yet from the same physical source (HDFS is quite a universal choice in this regard). After all we understand that keeping a specialized copy each for every kind of engine enormously adds to storage costs. It may be debatable whether we do need multiple engines at the same time on the same data but such approaches make Data Lakes to be a real possibility. There is probably no one size fits all solution, and business angles like licensing, premium charging for premium engines etc. push us further in the direction of having multiple solutions at the same time.

Comments

Popular Posts