Data preparation for AI: Why is human intervention still necessary?


AI and machine learning (ML) projects rely heavily on data. Data is critical for training, testing, validating, and supporting ML algorithms at the heart of AI systems, even more so than application code. The availability of nearly limitless cloud computing, big data to train machine learning models, and the evolution of deep learning algorithms contribute to AI’s resurgence in popularity. The third and fourth reasons are data-dependent. The more data you feed AI algorithms, the better they will perform and the more significant the machine learning results.

However, having a large amount of data isn’t enough. AI systems are unable to function without high-quality data. Many machine learning project failures are caused by factors other than the machine learning algorithms, code, or even the chosen technology vendor. Problems almost always revolve around data quality. The data must be clean, accurate, complete, and well-labeled for machine learning models to be properly trained and provide the expected accurate results. A crucial step in machine learning is data preparation.

As a result, most of the time spent on AI projects is spent on data collection, cleaning, preparation, and labeling. Enterprises are discovering that these data preparation steps require more investment than data science, model training, and operationalization. This necessitates a significant increase in demand for data preparation and labeling tools and services.

Steps in data preparation

According to a recent report by AI research and advisory firm Cognilytica, over 80% of the time spent on AI projects is spent on data preparation, cleaning, and labeling. According to the report, the number of steps involved in data collection, aggregation, filtering, cleaning, deduping, selecting, enhancing, and labeling data far outnumber the steps involved in data science, model building, and deployment.

A new set of data preparation tools has emerged on the market, designed to manage large data sets and optimized for machine learning projects. The market for AI-based data preparation tools is currently valued at over $500 million, according to the report, and is expected to more than double to $1.2 billion by 2023.

Most enterprise data isn’t ready for machine learning applications, and it takes a lot of work to get it ready. Standardizing formats across different data sources, removing or replacing invalid and duplicate data, confirming data is accurate and up to date, helping enhance and augment data as needed, reducing data noise, anonymizing data, normalizing data, allowing for proper data sampling, especially when working with large volumes of data, and allowing for feature engineering and exclusion are all tasks that data preparation tools must be able to perform.

According to Cognilytica’s report, the AI-relevant data preparation tools provide iterative and interactive ways for people to quickly view the impact of data preparation on data. The key features of these tools include the ability to quickly spot data anomalies, identify and remove duplicates, resolve data conflicts, normalize data formats, create pipelines for extracting and collating data from different sources, enhance data with additional features required for models, and anonymize data as needed for specific applications.

Extract, transform, and load (ETL) tools were once used by businesses to move data in and out of data warehouses to facilitate reporting, analytics, business intelligence, and other operations. Moving data in and out of warehouses with ETL is becoming less popular in the new cloud-based, big data-oriented environment. Instead, companies are attempting to work with data in whatever location it currently resides. Some refer to this as “drinking from the data lake” by this. Instead of ETL, companies are looking for tools that can pull data from a data source on-demand and transform it once it has been extracted and loaded. This is more like ELT than ETL. Many data preparation tools on the market, such as Melissa Data, Trifacta, and Paxata, work on the assumption that data is scattered across the organization in various formats.

Why do you need human touch in data preparation?

The algorithms for supervised machine learning must be trained on data that has been labeled with whatever information the model requires. Image recognition models, for example, must be trained on precise, well-labeled data that accurately represents what the system will recognize. If you want to identify cats, you’ll need a lot of images of cats to build a cat-recognition model.

Perhaps, it will surprise some, especially those who don’t work with machine learning models daily, how much of this data labeling work is done by humans. The majority of AI projects are supervised machine learning projects. AI projects require the most common data labeling workloads, and they involve object and image recognition, audio analysis, autonomous vehicles, and text and image annotation. Human-powered data labeling is, in fact, an essential component of any machine learning model that needs to be trained on unlabeled data. Humans are still required to manually label data and perform AI quality control, one of AI’s little secrets.

Many companies rely on internal labor or hire general labor for this labeling work. Companies spent over $750 million on internal labeling efforts in 2018, and this number is expected to reach over $2 billion by 2023.

In recent years, a new class of vendors has emerged to provide third-party labeling. For example, figure Eight, iMerit, and CloudFactory offer dedicated data labeling labor pools that can offload this work to remote workers who can operate at higher scales and at lower costs. According to the report, the market for third-party data labeling services was worth $150 million in 2018 and is expected to reach $1 billion by 2023.

Despite using third-party data labeling services, companies must spend twice as much on support for these efforts as the cost of actual data labor. This part of the machine learning project is so expensive because there is no way to completely remove the human from the equation. This is where AI encounters the chicken-and-egg dilemma. To train machine learning algorithms, you’ll need a lot of clean, accurate, and well-labeled data, but getting that data requires humans to clean and manually label it. You wouldn’t need humans if machines could do it. However, humans are required for machines to be capable of doing so.

Role of AI in data preparation

Fortunately, as AI models become intelligent and well-trained, they will be able to assist in some of these data preparation activities for machine learning. The report points out that most tools on the market are incorporating AI into their systems to help with data preparation, automate repetitive tasks, and assist humans with preparation tasks. To some extent, data prep and data labeling companies are increasingly using machine learning to provide autonomous quality control and autonomous labeling.

Some of these businesses employ artificial intelligence to aid in detecting anomalies, patterns, matches, and other aspects of data cleansing. Other companies use inferencing to identify data types and things that don’t match a data column’s structure. This assists in identifying potential data quality or formatting issues and providing advice on how to clean the data. According to the report, by 2021, all of the market’s leading data prep tools will have AI at their core.

Similarly, data labeling efforts are increasingly being augmented by AI and machine learning capabilities, according to Cognilytica’s report. Pre-trained models, transfer learning, and AI-enhanced labeling tools will reduce the amount of human labor required to create new models. As a result, AI efforts will accelerate, and efficiency on the more human-intensive side of AI will increase.

Because data is at the heart of AI and machine learning, companies will increasingly need good, clean, well-labeled data. Pre-trained neural networks will be available for organizations in the not-too-distant future. Businesses must invest in software that prepares data for machine learning until then.