Adhoc Query in Hadoop Ecosystem: A Study


In this post I try to examine, compare and summarize various options available in the Hadoop ecosystem to perform Adhoc queries. Even though performing queries on small scale databases (up to several hundred GBs) have many solutions in both commercial and open source spheres (via traditional SQL/OLTP/OLAP approaches) the problem becomes increasingly difficult to solve in a Big Data scenario (TBs to PBs). OLTP is practically unthinkable (and probably unnecessary) and the scale of data, analytic needs and available infrastructure present unforeseen challenges.

In the following, I attach a  study report which I prepared roughly a year back. Although old, the material can prove to be useful as a detailed comparative study of related products. Since this report was prepared, significant advancements have been made (Projects like Phoenix, Spark etc. got adopted under the Apache umbrella), new features have been added with improved stability and performance, new products came into the landscape (eg. Presto from Facebook) and some disappeared as well (eg. Panthera ASE). I have not put effort to keep this report up to date, but have followed the advancements and shifts closely. Will be happy to post/discuss them as opportunity arises.

Disclaimer: The information presented below is largely complied from freely available resources from the internet and my own inferences. I thank the original authors (too many to acknowledge them individually - I wish I could) for making their thoughts and information freely available for the greater good and advancement of human knowledge!


1. Introduction





2. General Architecture








3. Industry Landscape




4. Apache Hive














5. Cloudera Impala















6. Hortonworks Stinger Initiative

















7. Salesforce Phoenix













8. Intel Panthera ASE











9. AMPLab (UCB) Spark/Shark
















10. Summary











Comments

Popular Posts