Building Enterprise Data Warehouse For Big Data
During my multiple interactions with players in the Big Data field, both those who are trying to use it themselves and those who are trying to provide solutions for others, some debates seems never ending. One such topic is whether the traditional EDW (Enterprise Data Warehouse) Architectures are still valid in the era of Big Data. Simply because it is not just the sheer volume of data but also the various applications that also tend to change in implementation suitable for an environment build with Big Data and Cloud Technologies.
In this post I try to summarize my views and opinions.
Big Data technologies have been enjoying an
ever increasing popularity to build large scale data intensive applications. It
has now become possible to perform tasks ranging from deep analytics to
searches on large amount of data at scale which have been found inefficient or
prohibitively costly using traditional database techniques.
There are several key steps that are
involved in preparing the data making it suitable for analysis and processing
by Big Data technologies. The remarkable difference from traditional data lies
in the Four V’s often used to characterize Big Data: Volume, Variety, Velocity
and Veracity. As is true with any Data Processing system, the first step
involves ingesting data and normalizing it into a Data Model suitable for
consumption. Going further certain newer concepts come into play which we don’t
see in the traditional Data Warehouse designs or the Data Warehouse Bus
Architectures.
The Unified Data Model
Processing any kind of data using standard
techniques be it query or machine analysis, needs imposing certain structure
upon it. This becomes complex when the source data is diverse and large. In
plain terms it becomes an exercise of what data to keep and what to discard and
with lots of data with large number of dimensions becomes a monumental task.
The practice is domain specific (eg. Telco, Finance, Retail/e-Commerce etc.) where
individual data sciences come into play and with careful study of the source
data, de-duplication and normalization a suitable Unified Data Model can be
arrived.
To elaborate further, the Unified Data Model is a concept describing the Common
Global Data Model which can be used by various components of the Big Data
Solution. It is built on top of the raw data, specific to the needs of the
vertical domain encompassing Discovery, Ingestion, Storage, Analytics and Query.
A common data model often helps to eliminate the need for multiple data formats
or storage requirements for multiple applications in the same domain.
The Data Models would typically be Multi-Dimensional
Star Schema or 1NF tables in the simplest forms. However, rarely a deep
relational form like 3NF is chosen because with increased size of total data,
maintaining deep relations hurts the performance. It may however be noted that
while the Unified Data Model tries to impose a common structure for the entire
domain, it still does not eliminate the need for application specific Data
Models. To draw a quick example, a Unified Data Model for Telco might contain
information like Customer Information, Call Details, Service Subscription
Information and so on. While a Campaign Application (say pushing recommendation
for a particular tariff plan to the subscribers based on Data Analysis) would
need additional Data Models specific to Campaign Details.
ETL
Big Data typically requires bulk
processing. The processing is generally composed of Extract-Transform-Load or
ETL workflows. Even though it is called ETL, the phases may not happen in
exactly the same order. Depending of the intent it may be a simple EL flow or
even an ELT flow where Data will first be quickly ingested and then
transformed. Quick ingestion is a requirement during early stages of the Data
Pipeline because the speed at which Data can normally be written to the storage
(even a distributed storage like HDFS or HBase) is often slower than the rate
at which it is received (due to issues like multiple sources and multiple
formats).
Once Data is brought inside the system, it
may be subjected to further scrubbing and processing. As Data is processed
through multiple stages, it tends to become smaller, cleaner and more valuable
(by valuable it means the Data can now be used for taking real world actions
with lesser effort and utilizing conventional decision making tools). At a high
level ETL phases can be explained as follows:
Extract: Source Data from the immediate origin. It would primarily involve
tools like Connectors and Adapters to facilitate the movement
Transform: The various processing that can be performed on the data once it
has been ingested. Typically involves validation, structuring, normalization,
de-duplication, re-formatting and so on. ‘Transform’ usually processes data as
it arrives (extracted) or after it has been stored and large scale parallel
computing frameworks such as the Hadoop MapReduce are commonly employed. The
processing may include tasks to extract insights (by running Data Mining
algorithms) as well.
Load: Functionally the mirror of Extract Phase, helps in pushing the
processed data into the next storage. Data can further be consumed by a
Transform or another Extract phase
The Big Data EDW Architecture
Enterprise Data Architecture is a fairly
stable concept and deals with the way a Data Warehouse can be built using off
the shelf technologies. The key purpose of the Data Warehouse is to
holistically manage all the data needed by an enterprise and facilitate
business applications based on it. To a large extent the Data Warehouse does
not modify existing data but gives various tools by which the Data can be
processed to generate new data suitable for applications.
Applying Enterprise Data Warehouse concepts
to Big Data (characterized by the V’s) is not a significantly complex activity,
and the general principles apply. Yet careful analysis needs to be done to come
up with a suitable architecture.
Staging: The Staging Layer primarily deals
with Storage. Data from multiple sources are extracted and loaded into the
Staging Layer (implemented using HDFS for example). Staging is primarily an ‘EL’
activity and very little ‘T’ may actually be involved
Main DW: Represents the Main Part of the
Data Warehouse, can also be called the ‘Core DW’. It houses the raw data
transformed and structured into the Unified Data Model and is pretty much
everything that is left of the original data after it has moved from the
Staging. Only that it is now structured and organized for easier consumption by
platforms and applications. What is not there is probably useless, at least for
most use cases. The Main DW can almost universally be populated using ETL
workflows.
Summary DW: This is something which is
missing in traditional EDW architectures. At least not mentioned explicitly and
hence may appear as something not standard. Yes and No. To understand let’s
first try to understand its function: Big Data typically prohibits applications
to use the raw data as is. The purpose of data is finally to extract insights
(and insights are usually post computation smaller data) and the Summary Layer represents the
part of the DW and the Unified Data Model which needs to be populated using this
‘Derived’ data. This is very much like the pre-computed data or pre-computed
views suggested in the LambdaArchitecture.
As we can see it is something new and
useful in the context of Big Data processing. Since it may be potentially
accessed, analyzed or correlated more frequently than the Core DW and we can
even think of storing it separately on faster more efficient storages (an MPP
DB may be and not Hadoop). If we go about following a Star Schema in our EDW
Unified Data Model, it may be quickly detected that some interesting Dimensions
can be served from this Summary Information (eg. If the Fact represents the
Customer call records, the Customer Loyalty Score may be a summary data derived
using Machine Learning and considered as one of the Dimension. Potentially
allowing one to analyze whether a highly loyal customer is avoiding making long
distance calls and may be interested in a special tariff plan!). Dimensions
apart the Summary Data can easily serve creation and visualization of reports
or may be directly used for delivering a service with better efficiency. The Summary
DW may be populated using multiple combinations of ETL workloads and often
combining with Data Mining.
Data Mart: The Data Mart is pretty much in
the conventional sense; similar to standard EDW architectures. While the Main
and Summary DWs serve generic to the Domain, the Data Mart serves specific applications.
It will access parts of the Main DW and Summary DW to build its own Model and
since is the most commonly accessed and practically more useful to the final
application. Typically smaller and because of the frequent access patterns, and
more complex analytic workloads, the Data Mart would almost exclusively be on
traditional OLAP database systems (eg. MPP DBs)
Viable Implementation Options: Big Data
EDWs may be implemented using a variety of technologies. Hadoop and MPP DBs
(Open Source or proprietary, Hadoop based or not) score high. Hadoop is
suitable for meeting all the Large Scale Storage, ETL and Data Mining
requirements to build the Main DW and Summary DW. Once built, their storage can be either in
Hadoop (eg. HDFS/Hive) or MPP DB (eg. GreenPlum). The choice will mainly be
decided by scalability and workload requirements (SQL Support, Performance and
so on).
The Data Mart encountering more frequent
near real time query and analytic workloads, is more conveniently placed in MPP
DBs or if the scale permits even on relational databases servers. Typically
they will source data from the Main and Summary DWs and populate their schema.
It may however be possible to store Main
and Summary DWs on different kind of systems (eg. Main DW on Hadoop and Summary
DW on MPP), and the choice to do so will highly depend on how the Data Marts
and Applications higher up are going to be benefited.
Data Flow
When we go about building the DW and Data
Marts, it is very important to design the Data Flows in such a way that they
achieve the maximum work with minimum overhead. This would mean populating
parts of the model in as much parallel as possible and with maximum
consistency.
Redundancy cannot and probably should not
be avoided altogether as it compromises the performance in many ways (both
during loading and analysis), but as data gets changed those incremental
additions the Summaries may need to be re-computed and hence parts of the DW
(especially the Data Marts) may fall out of sync with reality. It may be
possible to counter this situation as long as the Marts themselves are also
re-build (incrementally) together from the changes happening below. Typically
there will be a single ETL flow which updates the DW and as parts get re-build
triggers the ‘Child’ ETLs to sync the Marts.
Summary
So far we have seen the various aspects on
how to build Enterprise Data Ware Houses for Big Data. The overall architecture
may not be remarkably different from the traditional approaches, but the sheer
volume and complexity and Big Data requires us to address this topic from new
angles. The ultimate goal is to identify the distinct layers, their
implementation options and connecting interfaces so that large scale data
analysis becomes both fast, flexible and reasonably accurate.
Comments
Post a Comment