Last month saw the introduction of Automatic Data Classification, a new machine learning (ML) powered feature in Collibra Catalog. This new feature increases the productivity of data stewards by automatically classifying data that is onboarded into our catalog. At Collibra, we believe that machine learning algorithms offer significant potential to enhance our products and improve our customers’ productivity. In this blog post, we share how we are doing this with Data Classification, as well as how we are thinking about building out our ML capabilities in the future.
Spend less time manually organizing data
A key part of the role of a data steward is to ensure data quality. Stewards need to accurately classify a wide variety of datasets across a range of disparate physical data sources. From an enterprise perspective, quality entails consistency. Being able to apply consistent business terminology (logical definitions) is crucial when analyzing data sourced from disparate systems and physical data sources.
Although data classification is a crucial responsibility for a data steward, it can also be a painstaking one. Most large enterprises are amassing ever-growing volumes of data, stored across a complex mix of systems. For anyone charged with classifying such large and diverse data sets, this scenario translates into a lot of repetitive, manual work.
Automatic Data Classification was designed to make the classification process more scalable. We have not sought to automate the role of the data steward, but rather to make their function more productive and free up their time to focus on more value-added tasks. Helping to automate data classification is a key part of that value proposition. While stewards will continue to serve a key role in data classification, our goal is to enable them to classify a lot of the more common data elements automatically, allowing them to focus on those that require more expert analysis.
How it works
After a data table is ingested, a data steward can navigate to the table or column asset page and run classification from the More dropdown.
Once the process is kicked off, sample data is sent to the Classification engine to be analyzed. If a class match is found for an individual column, that class will be returned along with a confidence level associated with that suggested match.
Users can then go in and approve or reject the suggested data class, which will improve the ML model’s future recommendations.
Accurate, scalable and extensible
In designing our automated data classification engine we chose to focus on three core characteristics: accuracy, scalability and extensibility.
As with any ML initiative, our algorithms improve progressively as they learn to recognize more classes of data with greater precision. The current service is trained to recognize around 40 classes of data out-of-the-box. These include a wide variety of personal information, physical address details, financial information, electronic communication details, product identifiers, as well as time and date records.
For each of these data types the classification engine considers several factors — including metadata, profiling data and the sample data itself — as part of the classification process. Every new dataset that is on-boarded provides an opportunity for training to help the algorithm further improve its accuracy. We measure this accuracy using three core metrics – precision, recall and something known as an F1 score (which is a weighted average of the first two) — and are proud to say that our classification engine has achieved an overall F1 score of 98% on training data.
To ensure that the new service is scalable, we have optimized our classification engine for performance. It can automatically classify up to 100 columns of data per second and has been designed to be horizontally scalable, making use of an elastic pool of compute and storage resources.
Finally, we have made sure that our classification engine is fully extensible, which means Collibra Catalog users can configure the platform to recognize their own data classes. That might include proprietary data types, such as internal alphanumeric codes used to identify employees, customers or accounts; or it could include data targeted at specific use cases, such as unique product or supply chain information. The engine itself is very easy to train and we are happy to share our experiences and provide hands-on guidance to support that process.
Future use cases
The first iteration of our ML algorithms was targeted at a very specific use case — automatically classifying newly ingested data. By keeping a narrow initial focus we have trained our algorithms to do one thing very well, and next, we will broaden our footprint by adding more use cases. We already have a number of those improvements included in our product roadmap.
One key step will be integrating our classification engine into Collibra Privacy & Risk, which will help significantly not only by automatically recognizing personally identifiable information (PII) but also automating the way that relevant policies are applied to that data. Knowing that a dataset includes PII relating to an EU citizen means GDPR policies could be automatically applied; with CCPA policies applied to PII relating to California residents, etc.
In addition to broadening use cases, we are also looking to deepen our algorithms’ intelligence by giving them more insight into the data itself. We plan to include an embedded data ontology — a hierarchical view of how different data classes relate to each other. As a simple example, an algorithm could know that it can use the “full name” column to populate “first name” and “last name” should those fields be missing.
We want your input
We developed our ML capabilities in response to demand from our customers and are always looking for further feedback to improve our product. If you would like someone to walk you through our new capabilities or have already started using them and want to provide your feedback, please do not hesitate to get in touch.