Data preparation is expensive and time-consuming, especially without automated and mature data preparation tools. Traditionally, data scientists write specific preparation scripts to accomplish project-specific goals.
Recently, the market has answered some of the general needs of data preparation by providing commercial preparation tools that can lower the burden of data scientists.
These data preparation tools are vital to any data preparation process and usually provide implementations of various preparators and a frontend to sequentially apply preparations or specify data preparation pipelines.
These tools’ flexibility, robustness, and intelligence contribute significantly to data analysis and management tasks. To better understand data preparation tools and their capabilities, we have shortlisted the top 7 data preparation tools.
1. Altair Monarch Data Preparation
Altair Monarch Data Preparation, called Datawatch until the company’s merger with Altair provides common data preparators for structured data but also transforms tables from within PDF and text files to tabular data. The extracted files from Altair’s table extractor feature can be used independently as a table, or they can be merged with other tables or files using a variety of join and union operations.
2. Paxata Self-Service Data Preparation
Paxata Self-Service Data Preparation offers many features to organize and prepare structured data and deals efficiently with semi-structured data. In addition to common data preparation features, Paxata offers so-called data filtergrams, which allow various visual interactions to perform filter operations on data, such as text filtergrams, numeric filtergrams, Boolean filtergrams, and source filtergrams. The user experience is emphasized in this tool, designed to support non-experts.
3. SAP Agile Data Preparation
SAP Agile Data Preparation runs on top of SAP’s HANA database system. It offers many common data preparators with specific system features, such as Schedule Snapshot, which allows the user to take periodic snapshots and retrieve data from a remote source on demand. It offers interactive suggestions to help users navigate and prepare data efficiently. Multi-user access allows the preparation of data in collaboration.
4. SAS Data Preparation
SAS Data Preparation is part of SAS Viya System Management, which runs its operations with distributed in-memory processing. In addition to common features, SAS offers code-based transformations for users to write and share custom code to transform data, supporting the re-usability of preparation pipelines.
5. Tableau Prep
Tableau Prep implements a workflow approach to organize and prepare messy data. With its interactive interface and workspace plans, users have the freedom to perform multiple operations simultaneously. Tableau prep comprises two parts: Tableau Prep Builder, designed to develop so-called flows, manage data and apply operations on data, and Tableau Prep Conductor, to share, schedule, and monitor the flows.
6. Talend Data Preparation
Talend Data Preparation offers many specific data preparation functionalities tailored to the task. For instance, for data cleaning, different functions exist for cleaning numeric data values, strings, and date inputs. One of its main features is the “selective sampling” of data for insights and operations that can be later deployed on the entire dataset. Talend actively contributes to solving system-level challenges, e.g., one of its intelligent system features is pipeline automation to save and reuse data preparation tasks or steps.
7. Trifacta Wrangler
Trifacta Wrangler uses multiple data preparation functions and intelligently predicts patterns to provide suggestions that help users transform data. Apart from common preparation tasks, it offers additional interesting features, such as primary key generation, transforming data by example, and permitted character checks. Wrangler uses regular expressions for most of its pattern-based features. The significance of Wrangler preparators is their degree of sophistication. For example, the located outlier identifies the outliers and plots a histogram of the entire column. The tool was spun out of the Wrangler project.