Why data preparation is crucial in AI workflows

Article By : Majeed Ahmad

Since data flows throughout the entire AI workflow, initial data preparation step is crucial as it ensures the most useful data in a model.

For design engineers, an artificial intelligence (AI) workflow encompasses four steps: data preparation, modeling, simulation and testing, and deployment. While all steps are important, many engineers often overemphasize the modeling stage, presuming that it plays the largest role in producing accurate insights.

However, since data flows throughout the entire AI workflow, the initial data preparation step is crucial. It ensures that the most useful data is entered into a model.

Figure 1 Data is the driving force in the development of an AI workflow. Source: MathWorks

So, what’s data preparation? It’s the first step in building an AI system, and it allows designers to understand how to solve an engineering problem. “If you understand data at the beginning, you’ll understand it at the end,” says David Willingham, deep learning product manager at MathWorks.

“Data is the driving force in the success of an AI model, so work on data first, and the rest will come,” Willingham added. While sharing some customer anecdotes, he emphasized what engineers need to do is determine if the given data will help solve the problem. Willingham cited the example of an engineer trying to forecast the operational efficiency of a manufacturing plant in advance while employing AI for help. When asked about the available data, he said, “I have monthly historical data.”

Data preparation is right at the beginning, and that’s why engineers must spend more time understanding the input data, which will inevitably benefit the output later in the AI workflow. “Don’t spend all your time on tuning a model,” advised Willingham. “If you understand the input data, performing basic analysis in many cases will transform data into something meaningful.”

How much data

How much data do you need? What’s the sweet spot? How much data is too much data? When you are working in a manufacturing plant, for instance, sensors are everywhere. There is a proliferation of data because sensors are now cheap. Data can be overwhelming in such situations.

Here, instead of thinking, is this enough or do I need more data, what engineers should really be thinking, “Is this the right data?” Moreover, instead of manually seeping through rows and rows of data, there are tools and techniques available to automate or semi-automate the search of useful patches within a large amount of data. In other words, to hone data to a smaller set, engineers must find patches of useful data within a large dataset.

Figure 2 Tools built around app-based workflows allow engineers to explore data and extract and even rank features from the automated data. Source: MathWorks

Another way is to not fully automate the feature engineering process. Sometimes, engineers need to inject insights and semi-automate the feature engineering process. It’s a good technique when you have a lot of data.

On the flip side, what happens when you don’t have enough data? It’s another common challenge. Here, it’s worth mentioning that engineers need to understand how they can build a business case on the value of data. It costs more to go out in the field and get hold of data or have more sensors to generate data. So, when engineers want more data, they have to link it back to the return-on-investment (ROI).

Physical vs. synthetic data

Besides getting more physical data and building a business case, how can tooling and software help? One approach leads to generating synthetic data that closely matches physical data. It’s a common way for engineers to try to supplement the real data to build a useful AI model.

There are different ways to generate synthetic data. One of them is having a realistic digital twin to get the data to build an AI model.

Within the tooling provided by MathWorks, digital twins are commonly created using model-based designs (MBDs) in which you take all the components of what a physical system would be. For example, in an autonomous vehicle, you take data from engine, transmission, automatic cruise control, etc. Next, by employing model-based design and creating digital twins, engineers can input simulation data that is synthetic and see if an AI model could be built from it.

Figure 3 Companies like Atlas Copco use digital twins to get the data for the predictive maintenance model. Source: MathWorks

A model-based design or MBD also helps in the latter part of testing; engineers can take a trained model created from synthetic data, put it back in the original system it was designed for, and test it. Take the case of Atlas Copco, which builds compressors for manufacturing plants all over the world.

The company has employed digital twins to get the data for the predictive maintenance models and then built simulation models for their pump equipment to create necessary data representing all field scenarios. UT Austin is another case study; it uses data pre-processing features to automatically transforms brain signals into images that can be used in deep-learning models.

The above design case studies show that the best practices and tools can support engineers in preparing data before putting it into an AI model. Eventually, this dataset in the AI model shapes how a model actually learns, analyzes, and arrives at a decision.

This article was originally published on EDN.

Majeed Ahmad, Editor-in-Chief of EDN and Planet Analog, has covered the electronics design industry for more than two decades.

Related Content

Leave a comment