A Metadata Management Architecture for Open Data

My friend and colleague presented an interesting aspect of metadata management in the open data world. A large proportion of his data was being sourced from the internet (most notably from social media and news) and often from external agencies such as research organizations and Govt. departments.

While, data acquisition in itself is challenging given the myriad forms of integration methods ranging from API calls to secure file transfers, a bigger challenge lies in management of the metadata.  Efficient management and governance of metadata provides a holistic view into the lineage of data, provisions to consume the data for analysis and finally a trace ability and compliance. 

Thus, management of metadata becomes an important capability to oversee changes while delivering trusted, secure data in a complex data integration environment, especially if the data points are unstructured. Good metadata management architecture, thus, plays a central role in holistic data governance.

His problem was to define a logical and physical architecture for such metadata. I decided to pitch in.

With my limited experience in metadata management architectures and a couple of discussion sessions, it appeared that the following logical architecture seemed pretty much inclusive of everything and would serve our purpose.
  
The key points worth noting are:
Unified Metadata Model
  1. Avoid reconciliation of different metadata information at application level
  2. Allow applications to treat different data sources and different metadata using single API
  3. Hide complexities of individual data sources and underlying data model
Metadata Repository
  1. Store the metadata for various data sources (catalog and data sets), translated into the internal Unified Metadata Model standard
  2. Performs the necessary translations done from Source Metadata (original form) to Target Metadata (exposed form)
  3. Serve as a cache for performance and availability considerations
  4. Need to be scalable, reliable and highly available
Data Source Adapter
  1. Abstraction to connect to different Data Sources, exposes unified APIs for the Repository
  2. Handles the authentication mechanisms applicable for different data sources
• Metadata Query Interface
  1. Interface to query the metadata associated with a Data Asset (eg. a social media post, a census report  etc.)
  2. The query interface also authenticates the client using the Metadata Authentication module
 Apparently a high level flow would something like the following:
The flow needed to be adaptive to situations where the metadata may be retrospectively modified by the data sources and had to be reconciled in real time to avoid stale metadata models being exposed to the consumer.

Then next step was to define a physical architecture with actual choice of technologies or software components in the solution and we came up with something like the following:


We decided to give MongoDB a try to act as the Metadata Repository. A few points worth mentioning about this architecture:
  1. MongoDB is suitable for large scale document storage and retrieval based on field values
  2. A typical metadata document can be stored in MongoDB and it also provides fast query and caching
  3. Multiple Shards can be deployed to take care of scalability and availability
  4. The Data Source Adapters are specific to the target Data Sources and handle authentication (e.g. Using OAuth2) and query of the target data sources
One challenge still remains on how to define the Metadata Model and how to achieve easy integration of new data sources. We came up with the following approach:
  1. We planned to take majority of the inputs from the Project Open Data Metadata Schema v1.1 as it serves as a comprehensive model for representing metadata about information available from any source (Databases, Social Media, XML Feeds etc.)
  2. The Metadata Model is stored as JSON/BSON documents, which provides a easy to use machine as well as human readable mechanism to store Metadata as well as supports dynamic Schema
  3. While adopting to a new data source the following components need to be developed and plugged in:
  • Metadata Translator (Translates the Source Metadata according to the Target Unified Metadata Model)
  • Data Source Adapter (Helps to forward the Metadata queries to the original data sources)
A sample of our Metadata Schema is presented below:

Meta Data Field Name
Label
Definition
category
Category
Main thematic category of the data asset.
title
Title
Human-readable name of the data asset. Should be in plain English and include sufficient detail to facilitate search and discovery.
description
Description
Human-readable description (e.g., an abstract) with sufficient detail to enable a user to quickly understand whether the data asset is of interest.
keyword
Tags
Tags (or keywords) help users discover your data asset; please include terms that would be used by technical and non-technical users.
modified
Last Update
Most recent date on which the data asset was changed, updated or modified.
publisher
Publisher
The publishing entity and optionally their parent organization(s).
contactPointName
Contact Name and Email
Contact person’s name for the data asset.
contactPointEmail
Contact Email
Contact person’s email for the data asset.
dataAssetIdentifier
Unique Identifier
A unique identifier for the data asset as maintained with a data source or agency.
This identifier is unique in the context of the data source and is a key input when querying the metadata
license
License
Name of the license applicable for consumption of the data
rights
Access Rights
This may include the public access rights indicators such as ‘Public Unrestricted’, ‘Public Restricted’, ‘Private’, ‘Classified’ etc.
geography
Geography
Indicates the geographical applicability of the data asset
temporal
Temporal
Indicates the temporal applicability of the data asset (such as validity period)
accrualPeriodicity
Frequency
The frequency with which the data asset is published.
conformsTo
Data Standard
URI used to identify a standardized specification the data asset conforms to.
format
Format
A human-readable description of the file format of a distribution.
mediaType
Media Type
The machine-readable file format of the data asset. Specified as the MIME Type
describedBy
URL
URL to the data dictionary for the data asset
describedByType
Data Dictionary Type
The machine-readable file format of the data asset’s Data Dictionary (describedBy). Specified as the MIME Type
isPartOf
Collection
The collection of data assets (specified by Data Asset ID) which this data asset is a subset.
publishDate
Publish Date
Date of formal publication of the data asset
language
Language
The language of the data asset, specified as ISO-639-1 codes
downloadUrl
Download URL
The fully qualified URL where the data asset can be downloaded from. Once invoked the data asset may be available in the MIME format of the data asset


The design has been pretty comprehensive and ready for implementation. However a large portion of tasks related to the adapters, meta data translators and the upstream query APIs (including authentication etc.) remains to be done and probably a topic for discussion some other time.






 

Comments

Popular Posts