Raw data, which are accumulated from various sources in the form of logs, sensor output, government data, medical research data, climate data, geospatial data, etc., are often messy!
They may lack standardized formats or structure and no specific target use case or patterns. They may even contain invalid characters, use different encodings, lack necessary columns, contain unwanted rows, and have missing values. Such data does not usually fit data analytics tools or data management systems.
Hence, we need a data preparation process, a set of preprocessing operations performed in the early stages of a data processing pipeline, i.e., data transformations at the structural and syntactical levels.
Data preparation transforms unstructured, frequently messy, raw data into a more useful, structured form ready for further analysis. Major activities (or tasks) that make up the preparation process include data profiling, cleaning, integration, and transformation.
Data preparation is integral to advanced data analysis and management for data science and any data-driven applications. With the ever-increasing amount of raw data, the need for data preparation has become more apparent. Preparing data yields many advantages, such as prompt error detection, improved analytics, improved data quality, enhanced scalability, accelerated data usage, and more easy data collaboration.
It is important to identify the phases required to create valid data for the consuming application to understand data processing in the data-to-application life-cycle. Data creation occurs typically in raw format, possibly to be stored in data lakes. Before these raw data are sent to applications, it is crucial to enhance their structure and, if needed, their content.
Several steps are typically performed to make data readable and machine understandable, such as data exploration, data collection, data profiling, data preparation, data integration, and data cleaning in various orders and iterations. These steps are applied to the original raw data before they are sent to the main application for further processing.
Data scientists spend approximately 80% of their time preparing the data and about 20% on actual model implementation and deployment. However, the time spent on data preparation can be decreased by a significant amount using sophisticated data preparation techniques, and, in turn, data scientists attain more time for model implementation and deployment.
Data preparation steps
In most cases, the preparation process consists of dozens of transformations that must be repeated several times. Despite technological advances for working with data, each of those transformations may involve much-handcrafted work and can consume a significant amount of time and effort. Thus, working with huge and diverse data remains a challenge. It is often agreed that data wrangling/preparation is the most tedious and time-consuming aspect of data analysis. It has become a bottleneck or “iceberg” for performing advanced data analysis, particularly on big data.
As we discussed, data preparation is not a single-step process. Rather, it usually comprises many individual preparation steps:
- Data discovery is analyzing and collecting data from different sources, for instance, to match patterns, find missing data, and locate outliers.
- Data validation comprises rules and constraints to inspect the data, for instance, for correctness, completeness, and other data quality constraints.
- Data structuring encompasses tasks for creating, representing, and structuring information. Examples include updating schema, detecting & changing encoding, and transforming data by example.
- Data enrichment adds value or supplementary information to existing data from separate sources. Typically, it involves augmenting existing data with new or derived data values using data lookups, primary key generation, and inserting metadata.
- Data filtering generates a subset of the data under consideration, facilitating manual inspection and removing irregular data rows or values. Examples include extracting text parts and keeping or deleting filtered rows.
- Data cleaning refers to the removal, addition, or replacement of less accurate or inaccurate data values with more suitable, accurate, or representative values. Typical examples are deduplication, filling in missing values, and removing whitespace.
Data preparation tools are vital to any data preparation process. They usually provide implementations of various preparators and a front end to sequentially apply preparations or to specify data preparation pipelines. These tools’ flexibility, robustness, and intelligence contribute significantly to data analysis and management tasks.