Solving Data Sufficiency Issues in Machine Learning Projects (Part 3)
The AutoML Conundrum: Introducing the Semantic Intelligence Engine
Welcome to Part 3 of this interesting series on how we can employ semantic techniques to speed up machine learning projects and at the same time ensure better outcomes because the data that we started with in the first place is closest to the business requirement.
You can find Part 1 and Part 2 here and here respectively.
In this part, let us examine the question: How can we Solve Data Sufficiency Issues in AutoML Pipelines. We plan to introduce an hypothetical component called the "Semantic Intelligence Engine".
To begin with what is Data Sufficiency, when we examine in the context of AutoML?
Mainly two points:
- Having sufficient data to proceed with an ML task
- Having enough knowledge to fill in the gaps in data if required
Thus, having an ecosystem to enable cross context (or even cross domain) application of data assets in the long run
In this post we present the idea of a “Semantic Intelligence Engine – SIE ™” which sits external to an AutoML / Traditional Machine Learning Platform.
The SIE ™ intends to make using an existing ML Platform easier by simplifying data preparation and data reconciliation tasks and shall use semantic data processing to accomplish these objectives (which in turn may include AI, NLP, Ontology and so on)
Let's see how this can be achieved.
First, how does a typical AutoML pipeline look like? Let's see the following picture:
As we can see, the Core ML platform modules is quite cohesive and very little intrusion is allowed or expected. In this situation we face an issue how to ensure that the data that goes into the platform is closest to the relevance of the business problem that we want to solve? In other words:
- Data Integration and Modeling is often out of scope of the AutoML Platform because the decisions need to be taken solely based on data availability and business intent
- AutoML Platforms are good at model selection and hyper-parameter tuning but not very advanced in terms on Feature Selection and Data Preparation
Challenge 1: How can we reduce data discovery and hence data preparation time-frames – aka. The Sufficient Data Issue
- Feature Engineering is a repetitive and semi automatic process – needs human assistance – changes depending on the business data available
- Reusing Pre-Trained Models difficult
Challenge 2: How can we reconcile data in new business contexts for pre-trained models – aka. The Filling the Data Gaps Issue
Let's see how can we address the same:
and, for how can we make use of pre-trained models to apply in newer and possibly different contexts the following approach:
Clearly the whole AutoML pipeline we can visualize, in the following manner, slightly modified and by introducing our "Semantic Intelligence Engine" in the right place:
So what do we achieve finally. The following can be said in summary:
- Faster Time to Market: Explore, discover and assimilate data sources based on semantic modeling and subsequently develop machine learning / AI applications faster
- Re-usability: Auto ML Platforms / Data Scientists can offer Reusable Pre Trained Models without tied to specific features
- Non Intrusive: Abstract and Automate Data to Knowledge to Data Transformations using Domain Specific Semantic Intelligence Units, without making direct changes to the AutoML Platform
In the end, I shall like to mention that please go through the earlier parts of the post series to see how from a technical architecture perspective, the Data Sufficiency issue can be resolved using semantic technologies. The same may be applied to build the Semantic Intelligence Engine as outlined in this post and augment AutoML pipelines.