Building Enterprise Data Warehouse For Big Data

During my multiple interactions with players in the Big Data field, both those who are trying to use it themselves and those who are trying to provide solutions for others, some debates seems never ending. One such topic is whether the traditional EDW (Enterprise Data Warehouse) Architectures are still valid in the era of Big Data. Simply because it is not just the sheer volume of data but also the various applications that also tend to change in implementation suitable for an environment build with Big Data and Cloud Technologies.

In this post I try to summarize my views and opinions.

Introduction
Big Data technologies have been enjoying an ever increasing popularity to build large scale data intensive applications. It has now become possible to perform tasks ranging from deep analytics to searches on large amount of data at scale which have been found inefficient or prohibitively costly using traditional database techniques.

There are several key steps that are involved in preparing the data making it suitable for analysis and processing by Big Data technologies. The remarkable difference from traditional data lies in the Four V’s often used to characterize Big Data: Volume, Variety, Velocity and Veracity. As is true with any Data Processing system, the first step involves ingesting data and normalizing it into a Data Model suitable for consumption. Going further certain newer concepts come into play which we don’t see in the traditional Data Warehouse designs or the Data Warehouse Bus Architectures.


The Unified Data Model
Processing any kind of data using standard techniques be it query or machine analysis, needs imposing certain structure upon it. This becomes complex when the source data is diverse and large. In plain terms it becomes an exercise of what data to keep and what to discard and with lots of data with large number of dimensions becomes a monumental task. The practice is domain specific (eg. Telco, Finance, Retail/e-Commerce etc.) where individual data sciences come into play and with careful study of the source data, de-duplication and normalization a suitable Unified Data Model can be arrived.


To elaborate further, the Unified Data Model is a concept describing the Common Global Data Model which can be used by various components of the Big Data Solution. It is built on top of the raw data, specific to the needs of the vertical domain encompassing Discovery, Ingestion, Storage, Analytics and Query. A common data model often helps to eliminate the need for multiple data formats or storage requirements for multiple applications in the same domain.


The Data Models would typically be Multi-Dimensional Star Schema or 1NF tables in the simplest forms. However, rarely a deep relational form like 3NF is chosen because with increased size of total data, maintaining deep relations hurts the performance. It may however be noted that while the Unified Data Model tries to impose a common structure for the entire domain, it still does not eliminate the need for application specific Data Models. To draw a quick example, a Unified Data Model for Telco might contain information like Customer Information, Call Details, Service Subscription Information and so on. While a Campaign Application (say pushing recommendation for a particular tariff plan to the subscribers based on Data Analysis) would need additional Data Models specific to Campaign Details.


ETL
Big Data typically requires bulk processing. The processing is generally composed of Extract-Transform-Load or ETL workflows. Even though it is called ETL, the phases may not happen in exactly the same order. Depending of the intent it may be a simple EL flow or even an ELT flow where Data will first be quickly ingested and then transformed. Quick ingestion is a requirement during early stages of the Data Pipeline because the speed at which Data can normally be written to the storage (even a distributed storage like HDFS or HBase) is often slower than the rate at which it is received (due to issues like multiple sources and multiple formats).

Once Data is brought inside the system, it may be subjected to further scrubbing and processing. As Data is processed through multiple stages, it tends to become smaller, cleaner and more valuable (by valuable it means the Data can now be used for taking real world actions with lesser effort and utilizing conventional decision making tools). At a high level ETL phases can be explained as follows:


Extract: Source Data from the immediate origin. It would primarily involve tools like Connectors and Adapters to facilitate the movement

Transform: The various processing that can be performed on the data once it has been ingested. Typically involves validation, structuring, normalization, de-duplication, re-formatting and so on. ‘Transform’ usually processes data as it arrives (extracted) or after it has been stored and large scale parallel computing frameworks such as the Hadoop MapReduce are commonly employed. The processing may include tasks to extract insights (by running Data Mining algorithms) as well.

Load: Functionally the mirror of Extract Phase, helps in pushing the processed data into the next storage. Data can further be consumed by a Transform or another Extract phase


The Big Data EDW Architecture
Enterprise Data Architecture is a fairly stable concept and deals with the way a Data Warehouse can be built using off the shelf technologies. The key purpose of the Data Warehouse is to holistically manage all the data needed by an enterprise and facilitate business applications based on it. To a large extent the Data Warehouse does not modify existing data but gives various tools by which the Data can be processed to generate new data suitable for applications.

Applying Enterprise Data Warehouse concepts to Big Data (characterized by the V’s) is not a significantly complex activity, and the general principles apply. Yet careful analysis needs to be done to come up with a suitable architecture.


Staging: The Staging Layer primarily deals with Storage. Data from multiple sources are extracted and loaded into the Staging Layer (implemented using HDFS for example). Staging is primarily an ‘EL’ activity and very little ‘T’ may actually be involved

Main DW: Represents the Main Part of the Data Warehouse, can also be called the ‘Core DW’. It houses the raw data transformed and structured into the Unified Data Model and is pretty much everything that is left of the original data after it has moved from the Staging. Only that it is now structured and organized for easier consumption by platforms and applications. What is not there is probably useless, at least for most use cases. The Main DW can almost universally be populated using ETL workflows.

Summary DW: This is something which is missing in traditional EDW architectures. At least not mentioned explicitly and hence may appear as something not standard. Yes and No. To understand let’s first try to understand its function: Big Data typically prohibits applications to use the raw data as is. The purpose of data is finally to extract insights (and insights are usually post computation smaller data) and the Summary Layer represents the part of the DW and the Unified Data Model which needs to be populated using this ‘Derived’ data. This is very much like the pre-computed data or pre-computed views suggested in the LambdaArchitecture.

As we can see it is something new and useful in the context of Big Data processing. Since it may be potentially accessed, analyzed or correlated more frequently than the Core DW and we can even think of storing it separately on faster more efficient storages (an MPP DB may be and not Hadoop). If we go about following a Star Schema in our EDW Unified Data Model, it may be quickly detected that some interesting Dimensions can be served from this Summary Information (eg. If the Fact represents the Customer call records, the Customer Loyalty Score may be a summary data derived using Machine Learning and considered as one of the Dimension. Potentially allowing one to analyze whether a highly loyal customer is avoiding making long distance calls and may be interested in a special tariff plan!). Dimensions apart the Summary Data can easily serve creation and visualization of reports or may be directly used for delivering a service with better efficiency. The Summary DW may be populated using multiple combinations of ETL workloads and often combining with Data Mining.

Data Mart: The Data Mart is pretty much in the conventional sense; similar to standard EDW architectures. While the Main and Summary DWs serve generic to the Domain, the Data Mart serves specific applications. It will access parts of the Main DW and Summary DW to build its own Model and since is the most commonly accessed and practically more useful to the final application. Typically smaller and because of the frequent access patterns, and more complex analytic workloads, the Data Mart would almost exclusively be on traditional OLAP database systems (eg. MPP DBs)

Viable Implementation Options: Big Data EDWs may be implemented using a variety of technologies. Hadoop and MPP DBs (Open Source or proprietary, Hadoop based or not) score high. Hadoop is suitable for meeting all the Large Scale Storage, ETL and Data Mining requirements to build the Main DW and Summary DW.  Once built, their storage can be either in Hadoop (eg. HDFS/Hive) or MPP DB (eg. GreenPlum). The choice will mainly be decided by scalability and workload requirements (SQL Support, Performance and so on). 


The Data Mart encountering more frequent near real time query and analytic workloads, is more conveniently placed in MPP DBs or if the scale permits even on relational databases servers. Typically they will source data from the Main and Summary DWs and populate their schema.

It may however be possible to store Main and Summary DWs on different kind of systems (eg. Main DW on Hadoop and Summary DW on MPP), and the choice to do so will highly depend on how the Data Marts and Applications higher up are going to be benefited.


Data Flow
When we go about building the DW and Data Marts, it is very important to design the Data Flows in such a way that they achieve the maximum work with minimum overhead. This would mean populating parts of the model in as much parallel as possible and with maximum consistency.


Redundancy cannot and probably should not be avoided altogether as it compromises the performance in many ways (both during loading and analysis), but as data gets changed those incremental additions the Summaries may need to be re-computed and hence parts of the DW (especially the Data Marts) may fall out of sync with reality. It may be possible to counter this situation as long as the Marts themselves are also re-build (incrementally) together from the changes happening below. Typically there will be a single ETL flow which updates the DW and as parts get re-build triggers the ‘Child’ ETLs to sync the Marts.

Summary

So far we have seen the various aspects on how to build Enterprise Data Ware Houses for Big Data. The overall architecture may not be remarkably different from the traditional approaches, but the sheer volume and complexity and Big Data requires us to address this topic from new angles. The ultimate goal is to identify the distinct layers, their implementation options and connecting interfaces so that large scale data analysis becomes both fast, flexible and reasonably accurate.

Comments

Popular Posts