Solving Data Sufficiency Issues in Machine Learning Projects (Part 1)

Currently many of the ML projects are unpredictable due to lack of clarity on data.
Whether the data available is sufficient for the desired use case to be implemented is not known until several iterations and by then it is too late.  Many of the ML projects fail at a late stage due to this.

Early predictability of data sufficiency can help in ML projects failing after lot of effort being put in.
It also helps preventing to attempt unrealistic use cases even before data availability checks are in place.

I would like to summarize the whole ML process as follows - this is basically  the CRISP-DM process.

1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Model Development
5. Model Evaluation
6. Model Deployment

The key problem is, there is a disconnect between step 3 and step 4. Many cases step 4 is carried on separately and we try to retrofit the model known based on prior knowledge or experience, to the data prepared in step 3. This is when we discover that there is a mismatch between what we have designed for and what we have.

So we plan to introduce an intermediate step - Step 3.X (let's call it Data Reconciliation) which acts as a connect between step 4 and step 3.

3.X: Data Reconciliation
The purpose of this step is to minimize the deviation between Modeling data attributes and Data in reality.

Inputs: The Data Models from both data preparation and analytical model development stages (the model development is a more elaborate step which involves analytical data/attribute modeling, algorithm selection, data sampling, training, validating and so on)

Output: Possible sources (data assets - like tables, attributes, files, data fields etc.) from to the prepared data which fit into the analytical data model.

Advantage: Know whether you have sufficient actual data to proceed with your model. If you don't have, which are the most likely data assets that may fill the gap. If you shall need to remodel your analytical model's data attributes (features) so that it fits into the data reality better. Finally saving a lot of effort and increasing the success of the machine learning project

What are the scenarios which we can possibly try to solve using this:

Scenario 1 we try to solve: How to port an existing designed, trained and working model to a new domain with minimum effort?  example; we have churn prediction model for telecom which we want to apply to a similar business domain - say VOIP calling business (Skype, Viber etc.). The standard method is someone starts the whole process from the beginning and uses his previous experience as learning only.

Other way what we try to do is we plug in the existing model and the data sources available from the new domain and the tool tries to identify the data similarities and proposes the changes in data that should be explored to make the model fit. The scope can extend to intelligent identification to the extent what transformations may even be needed to arrive at a data point needed by the model.

Scenario 2 we try to solve: Let's say the data scientist is building a Telecom churn prediction model and starts by exploring the data sources like Usage CDRs, CRM Logs, Billing/Payment/Charging logs etc. Then he arrives at a model. How does he ensure that his data exploration is complete? That there is no data source overlooked which can help to design the model better.

To solve the second scenario, the user may just need to give some 'desired' data descriptions (in form of a model and enough information to create an ontology from it - such as relationships, attributes, intrinsic type etc.) and the tool shall help to discover those data points from the source systems by means on ontology mapping techniques.

In a traditional CRISP-DM model, our tool fits in between Data Preparation and Model Design Phases (3.X step that I mentioned earlier) In Scenario 1, it is more close to Model Design Phase In Scenario 2, it is more close to Data Preparation Phase

How may we go about solving this problems?

A reasonable method appears to be Semantic Matching using Ontology methods.

The more common form is structural representation of data (such as location, encoding/formats, table name, file name etc.). Ontology deals with semantic representation of information (what the information element means rather than how it is stored or represented). There is not definite way to visualize an Ontology, but the defacto method is in form of a graph where nodes are classes (types/entties/kinds etc.) and edges are relationships between these classes.

In our scope also, Ontologies can be considered as a graph of relationships between data elements. To successfully construct and analyze an Ontology we however need to structurize it as much as possible so that it becomes machine friendly; and this is where we have developed languages called RDF and OWL. The idea is, RDF/OWL should be able to write the Ontological representations in a standard normalized way so that meta- information retrieval and analysis becomes easy. That way we can compare for example two information/data sources even if they are structurally dissimilar. This method relies on information retrieval using concepts like natural language search, relationship matching, attribute similarities etc. This may be the key to solving our problem.

In part-2 of this post, I shall try to elaborate a solution approach and techniques that could be applied to reduce data sufficiency issues.