Understanding Data Lake Trends from Major Tech Companies - Part 1 (Uber)
In this multi part series of short pieces we shall try to summarize some of the current practices (across on-premises/private cloud and public cloud deployments) followed by major Big Data driven technology companies based on publicly available literature.
In the end of the series we intend to come out with a high-level proposal which may be considered for future consolidation and evolution of the technical offering considering enterprise Big Data solutions.
Acknowledgement: Most of the material below is digested from Uber's Engineering Blog: https://eng.uber.com/
Understanding Uber Big Data Stack
Uber’s Big Data platform has evolved since 2014 and is currently in Generation 3+ . The following is a summarized form of the key features and takeways.
Generation 1 (2014-2015)
- Traditional DWH style architecture, based on PostgreSQL and commercial DWH – HP Vertica
- Ingestion Pipelines ensured Data Availability at D-2
Challenges:
- Jobs were not resilient
- Scale (up to low TBs) and handling semi structured data (like JSON)
- Cost of modeled data
Generation 2 (2015-2017)
- Hadoop based
- Spark for programmatic jobs
- Hive for analytical queries
- Presto (early adopters) for ad hoc / interactive query scenarios
- Ingestion Pipelines ensured Data Availability at Data Availability D-1
Challenges:
- Small file problem in HDFS (user defined batch jobs aggravated this)
- Snapshots were not easy (pipelines needed to be rerun to create more data copies)
Generation 3 (2017-2019)
- HDFS Scalability using ViewFS and Namenode Federation
- Small files especially those generated by system (e.g. YARN logs) moved to different namespace / cluster (ElasticSearch)
- “Sticher” to merge small files
Engineering best practices:
- Garbage Collector Tuning (CMS, G1GC)
- Selectively patching HDFS (decouple YARN and HDFS upgrades) if necessary
- Strict enforcement of HDFS and Hive Quotas. Ensure correlation between file numbers and storage space quota to indirectly enforce optimal file sizes
- Customer education to use Hive optimally (such as not create too many temp tables) as to not create pressure on Namenode and YARN
- Enable Erasure coding for new cluster builds
- Built and Introduced Hudi (Update, Delete, Snapshots)
- Continue with older stack composed of Spark, Hive, Spark and Presto
- Spark moved to Mesos cluster (out of YARN)
- Ingestion Pipelines ensured Data Availability at Data Availability T-60 mins
Generation 4/beyond (2019-)
- Containerized (Docker) Architecture for Hadoop deployment for immutable and repeatable deployments of daemon processes (Namenode, Datanode, etc.). More than 21,000 Hadoop nodes have moved to containers since 2020
- Observer Namenode (from Hadoop 3.3.x, separate NN for read only operations). Note: Observer Namenode may lack consistency
- Focus shifted largely to operations reliability (configurations, access management, upgrades etc.)
- Declarative cluster operations (rather than action oriented / imperative model) using Cadence (In house, later open sourced)
Challenges:
- Operational logic not implemented correctly meant unforeseen pressure on the clusters (e.g., directory walks for HDFS usage accounting resulted in considerable disk I/O and impacted running jobs)
- Region wise backups of HDFS needed to avoid block loss situations
Key Takeaways
- Continued to use Hive (MR engine) for long running queries. Deployed Hudi as a default abstraction on top of Hive to enable advanced features (like ACID, snapshot etc.)
- Use Presto for ad hoc interactive queries
- Use Spark and Spark SQL for programmatic access to data
- Move Spark executers to a separate cluster on Mesos (and later to Kubernetes)
- HDFS is a major bottleneck to performance and scale, take all means necessary to tune and re-design it and reduce pressure by carefully managing small files
- Containerize Hadoop and move away from bare metal deployments for operability, maintenance, and portability reasons
Comments
Post a Comment