Gain full visibility across your data landscape, find meaning in your data and improve the quality of business decisions.
Discover and download solutions and pre-built integrations for the Collibra Platform.
Get unparalleled value through the combined expertise and unique strengths of our people and technology.
See how security plays a key role in everything from how we build and deliver our platform to how we hire and train employees.
Collibra Privacy & Risk
Discover and understand data that matters so you can generate impactful insights that drive business value.
Understand your ever-growing amount of data in a way that scales with growth and change.
Show how data sets are built, aggregated, sourced and used, providing complete, end-to-end lineage visualization.
Build customer trust by operationalizing privacy policies and scaling compliance across new regulations.
Modernize your operations with a solution that is scalable, accessible and resilient: data in the cloud.
Drive digital growth and customer engagement by breaking down data silos and adding value to customer interactions.
Fuel your self-services analytics with the right data to develop unique business insights.
Innovate for the future while successfully navigating the complex web of regulations.
Transform decision making in the public sector with secure Data Intelligence that is FedRAMP Authorized.
Cloud ready data
Government and public sector
Tap into our knowledge base by connecting, sharing and learning from your peers in our Data Citizens community.
See how Collibra is helping global organizations unlock the value of their data.
Find the resources you need to accelerate time to value and fuel your growth.
Learn from the leaders in Data Intelligence through our individual courses, learning paths, and certification programs.
Data Citizens '20
Take your data strategy to the next level by arming yourself with the knowledge you need to achieve Data Intelligence.
Get advice, tips and tricks from our product experts and industry thought leaders to learn how to make your data meaningful.
Join the world’s largest virtual gathering of professionals focused on empowering businesses to deliver on strategic goals through Data Intelligence.
Check our upcoming events calendar to discover exciting opportunities to learn from our product and industry experts.
Connect the right data, insights, algorithms and people to optimize processes, increase efficiency and drive innovation.
Read our latest announcements, news coverage and thought leadership articles.
Find an opportunity to challenge and be challenged, and work with some of the most talented people in the business.
Get in touch with a member of our global team by locating an office near you, calling us or sending an email.
Last month saw the introduction of Automatic Data Classification, a new machine learning (ML) powered feature in Collibra Catalog. This new feature increases the productivity of data stewards by automatically classifying data that is onboarded into our catalog. At Collibra, we believe that machine learning algorithms offer significant potential to enhance our products and improve our customers’ productivity. In this blog post, we share how we are doing this with Data Classification, as well as how we are thinking about building out our ML capabilities in the future.
Spend less time manually organizing data
A key part of the role of a data steward is to ensure data quality. Stewards need to accurately classify a wide variety of datasets across a range of disparate physical data sources. From an enterprise perspective, quality entails consistency. Being able to apply consistent business terminology (logical definitions) is crucial when analyzing data sourced from disparate systems and physical data sources.
Although data classification is a crucial responsibility for a data steward, it can also be a painstaking one. Most large enterprises are amassing ever-growing volumes of data, stored across a complex mix of systems. For anyone charged with classifying such large and diverse data sets, this scenario translates into a lot of repetitive, manual work.
Automatic Data Classification was designed to make the classification process more scalable. We have not sought to automate the role of the data steward, but rather to make their function more productive and free up their time to focus on more value-added tasks. Helping to automate data classification is a key part of that value proposition. While stewards will continue to serve a key role in data classification, our goal is to enable them to classify a lot of the more common data elements automatically, allowing them to focus on those that require more expert analysis.
How it works
After a data table is ingested, a data steward can navigate to the table or column asset page and run classification from the More dropdown.
Once the process is kicked off, sample data is sent to the Classification engine to be analyzed. If a class match is found for an individual column, that class will be returned along with a confidence level associated with that suggested match.
Users can then go in and approve or reject the suggested data class, which will improve the ML model’s future recommendations.
Accurate, scalable and extensible
In designing our automated data classification engine we chose to focus on three core characteristics: accuracy, scalability and extensibility.
As with any ML initiative, our algorithms improve progressively as they learn to recognize more classes of data with greater precision. The current service is trained to recognize around 40 classes of data out-of-the-box. These include a wide variety of personal information, physical address details, financial information, electronic communication details, product identifiers, as well as time and date records.
For each of these data types the classification engine considers several factors — including metadata, profiling data and the sample data itself — as part of the classification process. Every new dataset that is on-boarded provides an opportunity for training to help the algorithm further improve its accuracy. We measure this accuracy using three core metrics – precision, recall and something known as an F1 score (which is a weighted average of the first two) — and are proud to say that our classification engine has achieved an overall F1 score of 98% on training data.
To ensure that the new service is scalable, we have optimized our classification engine for performance. It can automatically classify up to 100 columns of data per second and has been designed to be horizontally scalable, making use of an elastic pool of compute and storage resources.
Finally, we have made sure that our classification engine is fully extensible, which means Collibra Catalog users can configure the platform to recognize their own data classes. That might include proprietary data types, such as internal alphanumeric codes used to identify employees, customers or accounts; or it could include data targeted at specific use cases, such as unique product or supply chain information. The engine itself is very easy to train and we are happy to share our experiences and provide hands-on guidance to support that process.
Future use cases
The first iteration of our ML algorithms was targeted at a very specific use case — automatically classifying newly ingested data. By keeping a narrow initial focus we have trained our algorithms to do one thing very well, and next, we will broaden our footprint by adding more use cases. We already have a number of those improvements included in our product roadmap.
One key step will be integrating our classification engine into Collibra Privacy & Risk, which will help significantly not only by automatically recognizing personally identifiable information (PII) but also automating the way that relevant policies are applied to that data. Knowing that a dataset includes PII relating to an EU citizen means GDPR policies could be automatically applied; with CCPA policies applied to PII relating to California residents, etc.
In addition to broadening use cases, we are also looking to deepen our algorithms’ intelligence by giving them more insight into the data itself. We plan to include an embedded data ontology — a hierarchical view of how different data classes relate to each other. As a simple example, an algorithm could know that it can use the “full name” column to populate “first name” and “last name” should those fields be missing.
We want your input
We developed our ML capabilities in response to demand from our customers and are always looking for further feedback to improve our product. If you would like someone to walk you through our new capabilities or have already started using them and want to provide your feedback, please do not hesitate to get in touch.
Ben is working on tools to help data analysts and engineers make data meaningful.
© 2020 Collibra. All Rights Reserved.
A message to our Collibra community on COVID-19. Read more from our CEO.