Understanding Data Lake Trends from Major Tech Companies

June 25, 2022

Understanding Data Lake Trends from Major Tech Companies - Part 1 (Uber)

In this multi part series of short pieces we shall try to summarize some of the current practices (across on-premises/private cloud and public cloud deployments) followed by major Big Data driven technology companies based on publicly available literature.

In the end of the series we intend to come out with a high-level proposal which may be considered for future consolidation and evolution of the technical offering considering enterprise Big Data solutions.

Acknowledgement: Most of the material below is digested from Uber's Engineering Blog: https://eng.uber.com/

Understanding Uber Big Data Stack

Uber’s Big Data platform has evolved since 2014 and is currently in Generation 3+ . The following is a summarized form of the key features and takeways.

Generation 1 (2014-2015)

Traditional DWH style architecture, based on PostgreSQL and commercial DWH – HP Vertica
Ingestion Pipelines ensured Data Availability at D-2

Challenges:

Jobs were not resilient
Scale (up to low TBs) and handling semi structured data (like JSON)
Cost of modeled data

Generation 2 (2015-2017)

Hadoop based
Spark for programmatic jobs
Hive for analytical queries
Presto (early adopters) for ad hoc / interactive query scenarios
Ingestion Pipelines ensured Data Availability at Data Availability D-1

Challenges:

Small file problem in HDFS (user defined batch jobs aggravated this)
Snapshots were not easy (pipelines needed to be rerun to create more data copies)

Generation 3 (2017-2019)

HDFS Scalability using ViewFS and Namenode Federation
Small files especially those generated by system (e.g. YARN logs) moved to different namespace / cluster (ElasticSearch)
“Sticher” to merge small files

Engineering best practices:

Garbage Collector Tuning (CMS, G1GC)
Selectively patching HDFS (decouple YARN and HDFS upgrades) if necessary
Strict enforcement of HDFS and Hive Quotas. Ensure correlation between file numbers and storage space quota to indirectly enforce optimal file sizes
Customer education to use Hive optimally (such as not create too many temp tables) as to not create pressure on Namenode and YARN
Enable Erasure coding for new cluster builds
Built and Introduced Hudi (Update, Delete, Snapshots)
Continue with older stack composed of Spark, Hive, Spark and Presto
Spark moved to Mesos cluster (out of YARN)
Ingestion Pipelines ensured Data Availability at Data Availability T-60 mins

Generation 4/beyond (2019-)

Containerized (Docker) Architecture for Hadoop deployment for immutable and repeatable deployments of daemon processes (Namenode, Datanode, etc.). More than 21,000 Hadoop nodes have moved to containers since 2020
Observer Namenode (from Hadoop 3.3.x, separate NN for read only operations). Note: Observer Namenode may lack consistency
Focus shifted largely to operations reliability (configurations, access management, upgrades etc.)
Declarative cluster operations (rather than action oriented / imperative model) using Cadence (In house, later open sourced)

Challenges:

Operational logic not implemented correctly meant unforeseen pressure on the clusters (e.g., directory walks for HDFS usage accounting resulted in considerable disk I/O and impacted running jobs)
Region wise backups of HDFS needed to avoid block loss situations

Key Takeaways

Continued to use Hive (MR engine) for long running queries. Deployed Hudi as a default abstraction on top of Hive to enable advanced features (like ACID, snapshot etc.)
Use Presto for ad hoc interactive queries
Use Spark and Spark SQL for programmatic access to data
Move Spark executers to a separate cluster on Mesos (and later to Kubernetes)
HDFS is a major bottleneck to performance and scale, take all means necessary to tune and re-design it and reduce pressure by carefully managing small files
Containerize Hadoop and move away from bare metal deployments for operability, maintenance, and portability reasons

Search This Blog

The Flying Elephant: Thoughts & Experiments on Big Data

Understanding Data Lake Trends from Major Tech Companies - Part 1 (Uber)

Understanding Uber Big Data Stack

Generation 1 (2014-2015)

Challenges:

Generation 2 (2015-2017)

Challenges:

Generation 3 (2017-2019)

Engineering best practices:

Generation 4/beyond (2019-)

Challenges:

Key Takeaways

Comments

Post a Comment

Popular Posts

The Broken Window Theory