In our earlier blogs "Data Cleaning 101" and "Emerging Data Cleaning Solutions," we discussed the foundational elements of dirty data as well as the various solutions organizations can implement to automate data cleaning. However, as evident from these posts, only a handful of successful service providers on the market offer such automated solutions, and they are largely for cleaning enterprise data. Hence, in this blog, we will discuss the potential adjacent steps in data preparation where organizations can add automation to supplement their data cleaning initiatives.
Although the manual process of correcting/fixing data errors is indeed the most time-consuming aspect of a data science workflow, the overall data preparation process involves several steps which include data quality assessment (before cleaning), data cleaning, and data validation (post cleaning). For each of these steps, the solutions below offer an added level of automation that can help reduce time and labor involved:
Data quality assessment involves understanding the volume and relevance of the six error types (highlighted in our first blog) in datasets which is useful in informing the data cleaning pipeline. Data quality automation has roots in the finance industry, where anti-money laundering regulations enforced the need for monitoring and testing financial transaction data. One of the pioneers of this approach was InCube, which was founded in 2008 and acquired by Finantix in March 2020. The underlying technology for its approach is a combination of unsupervised machine learning and human intervention that distinguishes bad transaction data from good transaction data. The approach, however, does not identify the specific error types and their volume in the bad data. In recent years, startups like Pandata Tech, Toro Data, and Amazon SageMaker's Deequ have brought this approach to industrial and enterprise datasets as well. Pandata, which began as a predictive maintenance startup, offers a quality assessment software for sensor readings that is deployed at the edge for oil and gas applications. The software uses SCADA metadata to create a machine learning model of how the readings should appear and check the sensor feed for incomplete and incorrect data. The solution, however, is still early-stage, with Pandata piloting the solution for an offshore drilling company in 2019.
In manual data cleaning, the pipeline acts as a series of steps/workflow for the data scientist to incorporate; automating such pipelines can incrementally improve productivity in data science. Two universities that have led research here are the University of Columbia and TU Berlin. Columbia's ActiveClean is an iterative data cleaning software for predictive statistical models that suggests sections of data to clean first based on their importance to the use case and the likelihood that they actually are dirty. The tool acts like a prioritization system for the cleaning pipeline and simultaneously retrains the model with the recently cleaned data. TU Berlin's work, on the other hand, uses unsupervised machine learning on recreating data cleaning pipelines from the historic activity of data scientists that act as guiding documents for data scientists when cleaning similar datasets.
Similar in nature to data quality assessment, data validation can be considered a subset of data quality and is used to assess the data cleaning efforts. Validation, however, generally deals with smaller and cleaner datasets as opposed to quality assessment. The field has observed a large number of commercially available automation solutions that are script-based as opposed to machine learning-based. Several companies, including Paxata (acquired by DataRobot), Cognizant, Robocloud, and Tredence are active here, with the predominant focus being on enterprise data like emails and addresses for the error types of duplicates, incompleteness, and inaccuracies.
Overall, similar to the conclusions around data cleaning, it is evident that the predominant momentum for automation solutions in data quality and validation is also in enterprise data like personal identification information (PII) and financial data. For industrial applications, however, automation remains in its infancy, with research institutes like CSIRO currently building automated quality assessment tools for marine sensors.
In conclusion, companies should consider the following action items:
- Build an integrated automation solution for data preparation (quality assessment, cleaning, validation) if they are largely dealing with enterprise data and the primary error types are duplicates, inconsistencies, and incompleteness.
- Take inspiration from the research conducted by the University of Columbia and TU Berlin. The automation of data cleaning pipelines holds significant value for situations where manual data cleaning is still the norm. As evident from our analysis, this is largely true for complex industrial data and error types like business rule violations and inaccuracies.
Likewise, those interested should continue monitoring academia and research institutes for emerging solutions. As evident from the University of Waterloo's spinoff of the startups Tamr and Inductiv (see our Automated Data Cleaning Tech Page), the major share of data cleaning innovation is currently happening in academia. Innovative ideas like the use of synthetic data for training data cleaning models are already surfacing in research for added stimulation.
- Blog: Emerging Data Cleaning Solutions
- Executive Summary: The Impact of COVID-19 on Tech Innovation