A Metadata Management Architecture for Open Data
My friend and colleague presented an interesting aspect of metadata management in the open data world. A large proportion of his data was being sourced from the internet (most notably from social media and news) and often from external agencies such as research organizations and Govt. departments.
While, data acquisition in itself is challenging given the myriad forms of integration methods ranging from API calls to secure file transfers, a bigger challenge lies in management of the metadata. Efficient management and governance of metadata provides a holistic view into the lineage of data, provisions to consume the data for analysis and finally a trace ability and compliance.
Thus, management of metadata becomes an important capability to oversee changes while delivering trusted, secure data in a complex data integration environment, especially if the data points are unstructured. Good metadata management architecture, thus, plays a central role in holistic data governance.
His problem was to define a logical and physical architecture for such metadata. I decided to pitch in.
With my limited experience in metadata management architectures and a couple of discussion sessions, it appeared that the following logical architecture seemed pretty much inclusive of everything and would serve our purpose.
The key points worth noting are:
• Unified Metadata Model
- Avoid reconciliation of different metadata information at application level
- Allow applications to treat different data sources and different metadata using single API
- Hide complexities of individual data sources and underlying data model
• Metadata
Repository
- Store the metadata for various data sources (catalog and data sets), translated into the internal Unified Metadata Model standard
- Performs the necessary translations done from Source Metadata (original form) to Target Metadata (exposed form)
- Serve as a cache for performance and availability considerations
- Need to be scalable, reliable and highly available
• Data
Source Adapter
- Abstraction to connect to different Data Sources, exposes unified APIs for the Repository
- Handles the authentication mechanisms applicable for different data sources
• Metadata Query Interface
- Interface to query the metadata associated with a Data Asset (eg. a social media post, a census report etc.)
- The query interface also authenticates the client using the Metadata Authentication module
Apparently a high level flow would something like the following:
The flow needed to be adaptive to situations where the metadata may be retrospectively modified by the data sources and had to be reconciled in real time to avoid stale metadata models being exposed to the consumer.
Then next step was to define a physical architecture with actual choice of technologies or software components in the solution and we came up with something like the following:
We decided to give MongoDB a try to act as the Metadata Repository. A few points worth mentioning about this architecture:
- MongoDB is suitable for large scale document storage and retrieval based on field values
- A typical metadata document can be stored in MongoDB and it also provides fast query and caching
- Multiple Shards can be deployed to take care of scalability and availability
- The Data Source Adapters are specific to the target Data Sources and handle authentication (e.g. Using OAuth2) and query of the target data sources
- We planned to take majority of the inputs from the Project Open Data Metadata Schema v1.1 as it serves as a comprehensive model for representing metadata about information available from any source (Databases, Social Media, XML Feeds etc.)
- The Metadata Model is stored as JSON/BSON documents, which provides a easy to use machine as well as human readable mechanism to store Metadata as well as supports dynamic Schema
- While adopting to a new data source the following components need to be developed and plugged in:
- Metadata Translator (Translates the Source Metadata according to the Target Unified Metadata Model)
- Data Source Adapter (Helps to forward the Metadata queries to the original data sources)
Meta Data Field
Name
|
Label
|
Definition
|
category
|
Category
|
Main thematic
category of the data asset.
|
title
|
Title
|
Human-readable name of the data asset. Should be in plain
English and include sufficient detail to facilitate search and discovery.
|
description
|
Description
|
Human-readable description (e.g., an abstract) with
sufficient detail to enable a user to quickly understand whether the data
asset is of interest.
|
keyword
|
Tags
|
Tags (or keywords) help users discover your data asset;
please include terms that would be used by technical and non-technical users.
|
modified
|
Last Update
|
Most recent date on which the data asset was changed,
updated or modified.
|
publisher
|
Publisher
|
The publishing entity and optionally their parent
organization(s).
|
contactPointName
|
Contact Name and
Email
|
Contact person’s name for the data asset.
|
contactPointEmail
|
Contact Email
|
Contact person’s email for the data asset.
|
dataAssetIdentifier
|
Unique Identifier
|
A unique identifier for the data asset as maintained with
a data source or agency.
This identifier is unique in the context of the data
source and is a key input when querying the metadata
|
license
|
License
|
Name of the license applicable for consumption of the data
|
rights
|
Access Rights
|
This may include the public access rights indicators such
as ‘Public Unrestricted’, ‘Public Restricted’, ‘Private’, ‘Classified’ etc.
|
geography
|
Geography
|
Indicates the geographical applicability of the data asset
|
temporal
|
Temporal
|
Indicates the temporal applicability of the data asset
(such as validity period)
|
accrualPeriodicity
|
Frequency
|
The frequency with which the data asset is published.
|
conformsTo
|
Data Standard
|
URI used to identify a standardized specification the data
asset conforms to.
|
format
|
Format
|
A human-readable description of the file format of a
distribution.
|
mediaType
|
Media Type
|
The machine-readable file format of the data asset.
Specified as the MIME Type
|
describedBy
|
URL
|
URL to the data dictionary for the data asset
|
describedByType
|
Data Dictionary
Type
|
The machine-readable file format of the data asset’s Data
Dictionary (describedBy). Specified as the MIME Type
|
isPartOf
|
Collection
|
The collection of data assets (specified by Data Asset ID)
which this data asset is a subset.
|
publishDate
|
Publish Date
|
Date of formal publication of the data asset
|
language
|
Language
|
The language of the data asset, specified as ISO-639-1
codes
|
downloadUrl
|
Download URL
|
The fully qualified URL where the data asset can be
downloaded from. Once invoked the data asset may be available in the MIME
format of the data asset
|
The design has been pretty comprehensive and ready for implementation. However a large portion of tasks related to the adapters, meta data translators and the upstream query APIs (including authentication etc.) remains to be done and probably a topic for discussion some other time.
Comments
Post a Comment