Data preparation is a process of manipulating and organizing raw data before analysis. It is typically an iterative process of manipulating raw, unstructured, and messy data into a more structured and useful form. The preparation process consists of major activities (or tasks), including data profiling, cleansing, integration, and transformation.
In most cases, the preparation process consists of dozens of transformations that must be repeated several times. Despite technological advances for working with data, each of those transformations may involve much-handcrafted work and can consume a significant amount of time and effort.
Besides, data have many syntactic and semantic issues that can be bridged by careful automated or manual preparation and cleaning steps. However, current technology is still far from enabling a fully automatic transformation of data from its raw form to a shape and quality readily consumed by downstream applications.
Commercial (and academic) tools provide good user support and tooling for various preparation needs. Nevertheless, data preparation remains a largely manual task to be performed by data experts or by domain experts with data engineering skills. This post explores some of the most prominent challenges associated with data preparation tools.
1. Dataset pre-processing
Interestingly, most current data preparation tools require a pre-prepared or cleaned dataset as their input. Without this, the tools are more likely to misinterpret and load them improperly. Most tools also make the following broad assumptions:
- Single table file (no multi-table files)
- Specific file encoding
- No preambles, comments, footnotes, etc.
- No intermediate headers
- Specific line-ending symbol
- Homogeneous delimiters
- Homogeneous escape symbols
- Same number of fields per row
Some assumptions pose interesting problems, which need to be addressed separately, such as detecting tables in complex spreadsheets or converting HTML tables to relations.
2. User expertise
Another challenge in data preparation is the need for a combination of domain knowledge and IT knowledge for tool usability. Most tools require the user to be an expert in the dataset domain and have prior knowledge and understanding of the datasets and the data preparation goal. Moreover, beyond simple predicates, most tools allow regular expressions to match, split, or delete data. A typical domain expert cannot be expected to formulate often intricate regular expressions.
3. Lack of intelligent solutions
Most tools offer useful data preparation functions. However, most tools and preparators lack intelligent solutions for more automated data preparation tasks. For example, deduplicate data removes duplicate records from a source. Most tools deduplicate data only on exact match conditions, a more sophisticated version would involve deduplication based on similarity measures. Another problem for many tools is column heterogeneity, i.e., if columns contain data in multiple formats. Users need to manually filter those different groups and prepare them separately. Automatic homogenization would be helpful but may also pose a challenge.
4. Preparation pipelining
Data preparation is not a one-step process. Rather, it involves many subsequent steps organized in a preparation pipeline to gradually transform a dataset toward the desired output. Creating and managing pipelines yields many system-level challenges, such as preparation suggestions, pipeline adaption, and pipeline optimization, which must be addressed accordingly.
5. Data source diversity
Data preparation becomes an even bigger issue when considering data collected from various sources. Collecting data from too many sources becomes complicated with too many dubious variations, such as subtleties, abbreviations, and misspellings in human language. Besides, each data type also needs to be linked somehow with other types. This makes preparing the data a more complex task.
6. Data quality
Another significant barrier is insufficient data quality. Poor data quality has far-reaching effects and negative consequences, including less effective decision-making, reduced ability to make and execute strategy, customer dissatisfaction, lower performance, and increased operational cost. Poor quality data constitutes a significant cost factor for many companies since time, and other resources are spent detecting and correcting errors.
7. Integration with BI and analytics tools
Another common barrier is that data preparation is poorly integrated with BI and analytics tools. Users of self-service BI and visual analytics tools are often frustrated as they try to move beyond exploratory projects and deepen their interaction with more data. The performance will slow down, sometimes, because the activity goes beyond the tool’s technical limits for manipulating, filtering, blending, and enriching the data. The problems could be due to the organization’s data preparation processes, which may not be set up to handle numerous ad hoc questions.