In our previous blog, "Data Cleaning 101," we discussed the various types of errors that are the underlying cause of dirty data and presented a step-by-step process for designing data cleaning initiatives. In this follow-up post, we will discuss and analyze the various emerging startups and technologies that are enabling automation in the data cleaning process.
- Tamr – Founded in 2012, Tamr has raised close to $70 million in funding. It has developed an enterprise data unification platform that automates the process of organizing and unifying large datasets. The company uses a bottom-up, hybrid human-in-the-loop and machine learning approach to clean the data. Tamr first works with customers to prepare a target data scheme for how the final data should look and then unifies the data using unsupervised machine learning while reaching out to subject matter experts on a regular basis to ensure accuracy. Using the same approach, Tamr also removes duplicate entries of variables in the data. Primary target industries for Tamr include manufacturing, energy, and pharmaceuticals. For further details, see the case study "Tamr creates a unified, consistent data lake for oil wells."
- Trifacta – Also founded in 2012, Trifacta has raised more than $220 million in funding. Trifacta's software, called "Wrangler," is an enterprise data wrangling platform that automates structuring and formatting of raw data (JSON, XML, relational database formats) from machines, web pages, and servers. Wrangler uses machine learning to suggest transformations to the end user for data like time, day, machine parameters, and GPS locations. Primary target industries for Trifacta include finance, health care, and telecom.
- Datalogue – Founded in 2016, Datalogue has raised close to $2.6 million in funding. It has developed a data cleaning software that uses machine learning to identify missing or incorrect features in a data set (from sources like JDBC, ODBC, cloud storage, GCP, S3, and file systems) and corrects them appropriately. The platform predominantly focuses on personal identifier information (PII) data, such as names, phone numbers, and addresses. With that said, Datalogue is also capable of addressing a variety of data categories, such as sales data from CRM platforms, industry-specific data like pharma research literature, and organization-specific data like an e-commerce company's internal product specifications. Primary target industries for Datalogue include manufacturing, finance, and telecom.
Alongside the startups above, a handful of early-stage companies are also targeting this space. Inductiv, founded by the co-founder of Tamr in 2019, largely operated in stealth mode until it was recently acquired by Apple to support technology development for Siri. Although undisclosed, it is likely that Inductiv offered a product similar to Tamr's that handled data consistency and duplication. Likewise, Delman, a startup founded in 2016 in Indonesia, raised a $1.6 million Series A in May 2020. Similar to Tamr, Delman also offers automation for data consistency in unification and structuring of datasets.
Although the broader data cleaning landscape is still largely open and uncrowded, there are some interesting trends visible from the handful of names active here:
- The majority of these startups target enterprise datasets like CRM, addresses, or financial info and tailor them for two to three industry sectors. Even though Tamr and Datalogue offer services for industrial data like IIoT feeds and oil and gas well data, the bulk of their focus is centered on enterprise data.
- All of the above companies incorporate machine learning in some shape or form. Datalogue is the only one using deep learning models and offering full automation.
- Although most of these startups claim to target multiple data error types, the majority offer services to handle data inconsistency (structuring and wrangling). The only company that targets more complex error types is Datalogue; the company can identify and correct inaccurate and incomplete data entries.
In conclusion, those interested in data cleaning solutions should consider the following best practices:
- For duplicates and inconsistent formats, partner and work with existing service providers, as solutions for fixing these types of errors are more mature and commercially available.
- For completeness and correctness, build cleaning solutions either with an internal data science team or with external vendors. Furthermore, take inspiration from Datalogue's approach of using deep neural networks for data cleaning.
- For accuracy and business rule violations, focus on manual error correction given the lack of activity of emerging players that focus on fixing such errors.
Be cognizant that regardless of the error type, the majority of the solutions will only bring partial automation to data cleaning. Hence, while such solutions will improve the productivity of a data scientist, they will never replace the task of data cleaning in their work. In the upcoming final blog in this series, we will discuss some alternative techniques and ideas for data cleaning for expediting the broader process of data preparation for analytics.
- Executive Summary: The Impact of COVID-19 on Tech Innovation
- Press Release: Lux Research Establishes Framework for Selecting Robotics Vendors