A Metadata Management Architecture for Open Data

May 06, 2016

A Metadata Management Architecture for Open Data

My friend and colleague presented an interesting aspect of metadata management in the open data world. A large proportion of his data was being sourced from the internet (most notably from social media and news) and often from external agencies such as research organizations and Govt. departments.

While, data acquisition in itself is challenging given the myriad forms of integration methods ranging from API calls to secure file transfers, a bigger challenge lies in management of the metadata. Efficient management and governance of metadata provides a holistic view into the lineage of data, provisions to consume the data for analysis and finally a trace ability and compliance.

Thus, management of metadata becomes an important capability to oversee changes while delivering trusted, secure data in a complex data integration environment, especially if the data points are unstructured. Good metadata management architecture, thus, plays a central role in holistic data governance.

His problem was to define a logical and physical architecture for such metadata. I decided to pitch in.

With my limited experience in metadata management architectures and a couple of discussion sessions, it appeared that the following logical architecture seemed pretty much inclusive of everything and would serve our purpose.

The key points worth noting are:

• Unified Metadata Model

Avoid reconciliation of different metadata information at application level
Allow applications to treat different data sources and different metadata using single API
Hide complexities of individual data sources and underlying data model

• Metadata Repository

Store the metadata for various data sources (catalog and data sets), translated into the internal Unified Metadata Model standard
Performs the necessary translations done from Source Metadata (original form) to Target Metadata (exposed form)
Serve as a cache for performance and availability considerations
Need to be scalable, reliable and highly available

• Data Source Adapter

Abstraction to connect to different Data Sources, exposes unified APIs for the Repository
Handles the authentication mechanisms applicable for different data sources

• Metadata Query Interface

Interface to query the metadata associated with a Data Asset (eg. a social media post, a census report etc.)
The query interface also authenticates the client using the Metadata Authentication module

Apparently a high level flow would something like the following:

The flow needed to be adaptive to situations where the metadata may be retrospectively modified by the data sources and had to be reconciled in real time to avoid stale metadata models being exposed to the consumer.

Then next step was to define a physical architecture with actual choice of technologies or software components in the solution and we came up with something like the following:

We decided to give MongoDB a try to act as the Metadata Repository. A few points worth mentioning about this architecture:

MongoDB is suitable for large scale document storage and retrieval based on field values
A typical metadata document can be stored in MongoDB and it also provides fast query and caching
Multiple Shards can be deployed to take care of scalability and availability
The Data Source Adapters are specific to the target Data Sources and handle authentication (e.g. Using OAuth2) and query of the target data sources

One challenge still remains on how to define the Metadata Model and how to achieve easy integration of new data sources. We came up with the following approach:

We planned to take majority of the inputs from the Project Open Data Metadata Schema v1.1 as it serves as a comprehensive model for representing metadata about information available from any source (Databases, Social Media, XML Feeds etc.)
The Metadata Model is stored as JSON/BSON documents, which provides a easy to use machine as well as human readable mechanism to store Metadata as well as supports dynamic Schema
While adopting to a new data source the following components need to be developed and plugged in:

Metadata Translator (Translates the Source Metadata according to the Target Unified Metadata Model)
Data Source Adapter (Helps to forward the Metadata queries to the original data sources)

A sample of our Metadata Schema is presented below:

Meta Data Field Name	Label	Definition
category	Category	Main thematic category of the data asset.
title	Title	Human-readable name of the data asset. Should be in plain English and include sufficient detail to facilitate search and discovery.
description	Description	Human-readable description (e.g., an abstract) with sufficient detail to enable a user to quickly understand whether the data asset is of interest.
keyword	Tags	Tags (or keywords) help users discover your data asset; please include terms that would be used by technical and non-technical users.
modified	Last Update	Most recent date on which the data asset was changed, updated or modified.
publisher	Publisher	The publishing entity and optionally their parent organization(s).
contactPointName	Contact Name and Email	Contact person’s name for the data asset.
contactPointEmail	Contact Email	Contact person’s email for the data asset.
dataAssetIdentifier	Unique Identifier	A unique identifier for the data asset as maintained with a data source or agency. This identifier is unique in the context of the data source and is a key input when querying the metadata
license	License	Name of the license applicable for consumption of the data
rights	Access Rights	This may include the public access rights indicators such as ‘Public Unrestricted’, ‘Public Restricted’, ‘Private’, ‘Classified’ etc.
geography	Geography	Indicates the geographical applicability of the data asset
temporal	Temporal	Indicates the temporal applicability of the data asset (such as validity period)
accrualPeriodicity	Frequency	The frequency with which the data asset is published.
conformsTo	Data Standard	URI used to identify a standardized specification the data asset conforms to.
format	Format	A human-readable description of the file format of a distribution.
mediaType	Media Type	The machine-readable file format of the data asset. Specified as the MIME Type
describedBy	URL	URL to the data dictionary for the data asset
describedByType	Data Dictionary Type	The machine-readable file format of the data asset’s Data Dictionary (describedBy). Specified as the MIME Type
isPartOf	Collection	The collection of data assets (specified by Data Asset ID) which this data asset is a subset.
publishDate	Publish Date	Date of formal publication of the data asset
language	Language	The language of the data asset, specified as ISO-639-1 codes
downloadUrl	Download URL	The fully qualified URL where the data asset can be downloaded from. Once invoked the data asset may be available in the MIME format of the data asset

The design has been pretty comprehensive and ready for implementation. However a large portion of tasks related to the adapters, meta data translators and the upstream query APIs (including authentication etc.) remains to be done and probably a topic for discussion some other time.

Search This Blog

The Flying Elephant: Thoughts & Experiments on Big Data

A Metadata Management Architecture for Open Data

Comments

Post a Comment

Popular Posts

The Broken Window Theory

Understanding Data Lake Trends from Major Tech Companies - Part 1 (Uber)