There are seven key areas of data management – creation, security, transmission, cleaning, analytics, storage, and sharing. Among all these, cleaning is widely regarded as the most labor- and time-intensive process for data scientists. For example, according to a survey by CrowdFlower, up to 80% of a data scientist's time is spent on finding, cleaning, and organizing data, leaving only 20% of their time to actually perform analysis. More remarkably, close to 60% of the respondents considered data cleaning the least enjoyable aspect of their job.
Not only does data cleaning directly have a detrimental impact on the productivity of data science teams, but it also limits the scalability and true potential of digital transformation (DX) given that data analytics form the cornerstone of any emerging digital solution. According to a survey by Gartner, close to 40% of DX initiatives fail due to poor data quality – overall, highlighting an underserved aspect of data management that is ripe for innovation. With that said, there are few commercially available solutions that can truly automate the process of data cleaning effectively, a puzzling observation given the urgent need for industrial data cleaning solutions.
To understand the underlying bottlenecks behind data cleaning, we take a step back and first discuss what exactly dirty data is, followed by what the data cleaning process entails. Data, which can be broadly classified as industrial or enterprise, can become dirty for many reasons:
- Incomplete data: Data has missing or empty values in critical rows and columns. For example, a temperature data point from a sensor is missing the time and date at which the measurement was recorded.
- Incorrect data: Data has values that are outside its designated valid range – for example, temperature measurements in Kelvin that are below zero.
- Inaccurate data: Data is technically correct but not accurate for the context. For example, temperature measurements taken in the morning are recorded as measurements taken in the evening instead.
- Inconsistencies in data: Data has values that are recorded in different formats in the same sample. For example, temperature readings fluctuate between Kelvin and Celsius in the same dataset.
- Duplication of values: Data has values that are repeats or duplicates of each other. For example, a temperature measurement taken at noon is recorded twice in the dataset for the same day.
- Business rule violations: Data has values that are not aligned with the application requirements or business policies. For example, temperature readings between 200 K and 300 K are to be marked as "critical" in the data, but many rows are marked otherwise.
In general, dirty data can arise from three primary avenues – human errors while recording/joining/filtering/parsing/decoding data, faulty raw sensor data in IoT applications, and data corruption due to hardware or software damage as it relates to data storage. Regardless of the source, data cleaning involves preparing these dirty datasets for further analysis, and companies should use the following steps while rolling out data cleaning initiatives:
- Define good data: One of the first and critical steps is to define what is meant by good data and use that definition to create an end goal for data cleaning. Clearly, lower the volume of errors, the better. However, this does not mean that organizations should “boil the ocean” when it comes to cleaning data, which can be both costly and time consuming. Data should only be as clean as needed for a specific application or use case. From this perspective, good data is defined as data that is fit for operational use and decision-making for the business and is measured based on the prevalence as well as impact of the six error types discussed above.
- Determine time and resource available: After defining good data, organizations should review large samples of data to identify the total volume of errors in the dataset. This is done via a combination of software scripts and visual inspection by subject matter experts. An accurate estimate of the volume of errors helps plan budgets and timelines for data cleansing. For example, hiring a data cleaning vendor for cleaning 10,000 records can cost somewhere between $5,000 and $15,000 and take close to 2,500 hours per data scientist. To offer added context, for a typical predictive maintenance dataset, the overall cleaning costs can range anywhere between $30,000 and $50,000.
- Search and correct errors: Regardless of the error type, data cleanup is mostly done via a combination of manual editing by subject matter experts and software scripts. Cleanup of some of these error types are easier to automate compared to others. For example, duplicates can be corrected via deduplication algorithms, an approach that is fairly mature. However, for most error types, cleanup is still far from being automated.
- Identify root causes of errors: Finally, organizations should take the extra step to investigate the underlying reason for the dirty data, be it human error or sensors so as to future-proof data against such errors.
Those embarking on digital transformation projects should first build a data cleaning strategy as highlighted above and then prioritize their efforts accordingly. In an upcoming insight, we will discuss in detail some emerging solutions that can be applied for automating certain aspects of the data cleaning process thereby allowing for easily scalable DX projects.