Solving Data Sufficiency Issues in Machine Learning Projects (Part 3)

December 11, 2019

Solving Data Sufficiency Issues in Machine Learning Projects (Part 3)

The AutoML Conundrum: Introducing the Semantic Intelligence Engine

Welcome to Part 3 of this interesting series on how we can employ semantic techniques to speed up machine learning projects and at the same time ensure better outcomes because the data that we started with in the first place is closest to the business requirement.

You can find Part 1 and Part 2 here and here respectively.

In this part, let us examine the question: How can we Solve Data Sufficiency Issues in AutoML Pipelines. We plan to introduce an hypothetical component called the "Semantic Intelligence Engine".

To begin with what is Data Sufficiency, when we examine in the context of AutoML?
Mainly two points:

Having sufficient data to proceed with an ML task
Having enough knowledge to fill in the gaps in data if required

Thus, having an ecosystem to enable cross context (or even cross domain) application of data assets in the long run

In this post we present the idea of a “Semantic Intelligence Engine – SIE ™” which sits external to an AutoML / Traditional Machine Learning Platform.

The SIE ™ intends to make using an existing ML Platform easier by simplifying data preparation and data reconciliation tasks and shall use semantic data processing to accomplish these objectives (which in turn may include AI, NLP, Ontology and so on)

Let's see how this can be achieved.

First, how does a typical AutoML pipeline look like? Let's see the following picture:

As we can see, the Core ML platform modules is quite cohesive and very little intrusion is allowed or expected. In this situation we face an issue how to ensure that the data that goes into the platform is closest to the relevance of the business problem that we want to solve? In other words:

Data Integration and Modeling is often out of scope of the AutoML Platform because the decisions need to be taken solely based on data availability and business intent
AutoML Platforms are good at model selection and hyper-parameter tuning but not very advanced in terms on Feature Selection and Data Preparation

Which means:

Challenge 1: How can we reduce data discovery and hence data preparation time-frames – aka. The Sufficient Data Issue

and,

Feature Engineering is a repetitive and semi automatic process – needs human assistance – changes depending on the business data available
Reusing Pre-Trained Models difficult

Which means:

Challenge 2: How can we reconcile data in new business contexts for pre-trained models – aka. The Filling the Data Gaps Issue

Let's see how can we address the same:

and, for how can we make use of pre-trained models to apply in newer and possibly different contexts the following approach:

Clearly the whole AutoML pipeline we can visualize, in the following manner, slightly modified and by introducing our "Semantic Intelligence Engine" in the right place:

So what do we achieve finally. The following can be said in summary:

Faster Time to Market: Explore, discover and assimilate data sources based on semantic modeling and subsequently develop machine learning / AI applications faster
Re-usability: Auto ML Platforms / Data Scientists can offer Reusable Pre Trained Models without tied to specific features
Non Intrusive: Abstract and Automate Data to Knowledge to Data Transformations using Domain Specific Semantic Intelligence Units, without making direct changes to the AutoML Platform

In the end, I shall like to mention that please go through the earlier parts of the post series to see how from a technical architecture perspective, the Data Sufficiency issue can be resolved using semantic technologies. The same may be applied to build the Semantic Intelligence Engine as outlined in this post and augment AutoML pipelines.

Search This Blog

The Flying Elephant: Thoughts & Experiments on Big Data

Solving Data Sufficiency Issues in Machine Learning Projects (Part 3)

Comments

Post a Comment

Popular Posts

The Broken Window Theory

Understanding Data Lake Trends from Major Tech Companies - Part 1 (Uber)