The Data Catalog Landscape
Nowadays the market is flooded with data catalog solutions. They all aim at solving the problem that data consumers, and more specifically the expensive business analysts and data scientists, have: they spend more time trying to find the right data than they do actually using that data.
Most of these data catalogs have a strong focus on ETL (extract, transform, load actual data). Why? Because of the simple philosophy that there is no data catalog without the possibility to connect to the data so they can populate the catalog. They will answer the question: “Where is my data stored?” Other catalogs go a bit further by allowing SMEs of a data set to contribute by adding their knowledge about these data sets such as a ‘description’. They answer the question “What is this data set about?”
What Makes Collibra Unique?
At Collibra, our vision for a data catalog is slightly different. We consider the data loading as a given, the “where is it stored” as a secondary question to the more important “where is it coming from,” and the presence of SME knowledge as a basic feature. We believe that the thing that transforms a data dictionary into a data catalog that is useful for the business analyst, is the crowdsourcing of its metadata and the governance of its contents.
To support this vision, the Collibra Catalog Portal doesn’t take you to a page that shows an abundance of databases that have names only understandable by people that created or maintain them. It also doesn’t take you to a blank search page where you must come up with initial search terms that is both general enough to get results, and also specific enough to find useful data.
Although we do support these options as well, our Catalog portal welcomes you by recommending Data Sets that are interesting for you.
Now these data sets are not just columns, grouped in their physical tables. The Collibra data sets are actually Catalog data sets that can be created by anyone inside the catalog and serve as grouping Tables and Columns from multiple sources to support a specific task they need to do. So although the actual data still resides in their respective physical locations and is not moved or duplicated, the Collibra logical data set layer allows to tie a combination of data sources to a specific task which can be described in that data set. Grouping data from different sources in such a logical data set, is the first step in the metadata crowdsourcing we aim for.
Imagine yourself as a business analyst accessing the data catalog in order to find data to create his report on “Counterparty Risk Rating”.
Some questions you will certainly have:
- Which is the best data to use for my report? (Fit for purpose)
- If it comes from different sources, will I be able to join this data?
- Who is most knowledgeable about this data? Who can give me some context around it?
- What is the quality of the data? Or in other words, can I trust it to base my report on?
- Who can give me access to the data?
- Where does the data in these data sets originate from? How can I trace the data back to its source(data lineage)?
Most data catalogs will answer question 4 and maybe question 5 and 6, and so does Collibra.
However, at Collibra we want to answer all the things you really want to know. So instead of starting your search imagining which components you actually need (e.g. outstanding loans and liquidity risk of the most important customers.) to assemble your report, you will search for the end result you want to achieve: “counterparty risk rating report” and find that someone else already grouped different data sources into a logical data set for a similar purpose.
At a blink of an eye, it is clear that this data set has been tagged and rated by users and is related to a number of business terms. This is the second step of Collibra metadata crowdsourcing that gives you tons of information about people’s appreciation for that data set and its sources. Or would you put more trust in a data set that has been ‘officially certified’, instead of solely trusting user feedback? The Collibra platform workflow capabilities allows setting up a dedicated certification process so that the only thing to look for as a data consumer is the ‘certified’ ribbon on your data set as you can see below.
When visualizing the lineage of this data set in Catalog, you will see that it actually contains data coming from two sources, with varying quality. You also see who owns the data in those sources, but even more important, who combined them into this logical data set called “Counterparty Risk Rating” in the past.
So the answer to question #1: “What is the best data to use for my report?”, will get a jump start by looking at the data sources used in this data set.
A Data Catalog with Built-In Governance
Many data catalog companies claim to do data governance “as well”, mentioning it as a nice side dish to their mains. At Collibra, we do not believe in bolt-on data governance. We believe it should be the built-in core of the data catalog.
Now, having a built in data governance of not only data sets, but also usages such as reports, data sharing agreements, business processes, and more, the complete data lineage between data sources and all its usages can be visualized in Catalog. Every usage will show the involved people, which may give you valuable information in terms of who to contact to get more information. Looking at this kind of “Forward Lineage” shows you how the data has been put to use in the past.
With the above lineage, also question #2 “Will I be able to combine the data when coming from different sources” is answered. Being able to visualize the data sets or the reports generated in the past, shows you how the combination has been used before.
That leaves you, the analyst, with question #3: “Who is most knowledgeable about the data?”
As stated before, the Collibra Catalog is seamlessly integrated with the market leading data governance platform of Collibra. This means that not a single piece of data can get into Collibra Catalog without assigning an owner to it. By doing this, we impose governance from the very beginning. This is important when later on, someone wants to get access to the data or report a problem with the data. By having an owner for each uploaded data element and a complete history captured inside Collibra Catalog, no question should remain unanswered.
On top of this data governance platform, other applications such as business glossary, data helpdesk, policy manager, and stewardship can easily be leveraged to further intensify the user experience of finding, understanding and trusting the data, which people in your organization are searching for. These applications provide context to the data sets and allow you to see how they fit a bigger picture.
Linking the data sets and data usages from the catalog to the business terms in the glossary is crucial for the assessment of the usability and usefulness of a data set. Being part of the same platform allows Catalog to even propose links between data sets and business terms.
The business glossary adds business context to the data, as well to data sets as to its usages.
How about data sets that have issues? Let’s say the quality of the data set that seems most interesting is too low to be usable. Why is that, and most importantly, who can fix it? Having a data helpdesk at your disposal through your catalog guarantees the issues that exist with the data you need can be solved quickly and accurately.
And are the data sets inline with the policies and standards that are in place around data protection? Did the mass ingested data breach any of them? How do we control or correct this?
Having a policy manager linked to the Catalog allows policies and standards to be enforced upon ingested data and visualized in the Catalog.
And then there are the people that make it all happen, the data stewards. Whether they are technical stewards, making sure the technical metadata of all data sets in the catalog is well maintained, or whether they are business stewards, defining and refining business terms and their definitions and lines of business. They all utilize the common data governance backbone of Collibra, allowing them to collaborate with each other and have their well defined responsibilities in every process, so you as an analyst, can request whatever data as easily as ordering on Amazon and trust the access request will find its way to the correct people.
Shop for data
Because this is the last crucial step in the Catalog experience for the business analyst, there needs to be an easy way to request access to the data. By empowering Catalog users to combine data from different sources into logical data sets, or use logical data sets made in the past, getting an approval from the different data owners could potentially be a nightmare. How to contact the owners? When are they available? What if they are on a holiday? Many questions an analyst that searches for data can not answer. To solve this problem, the Collibra Catalog introduces a “Data Basket.”
It is a real shopping basket for data sets, which takes care of all the access requests behind the scenes. Compare it to a real webshop. If you order items online, you don’t want to go and find out who at Amazon is responsible for getting the item to the packaging service, or where the items are stored in the warehouse. Or if you order items with different lead times, you don’t want to be the one aligning them all.
Same goes for the Collibra Data Shop. The platform’s workflow engine starts the whole process of verifying where the data is stored, finding out who is responsible, or who is replacing the responsible person in case they are ill or on a holiday. All you as an analyst will see is that your “order has been received” and that “approval for your access request is pending”. You will be alerted once your data is available, or -in the case of access not being granted- you will get a reason why and the contact details of someone to give you more details on this
The most important asset
So when your company is looking for a data catalog, they should keep in mind that they are not buying an IT tool to help them manage the data. They are buying a tool for data consumers to help these people find, understand and trust the data that is available in the company. We at Collibra believe that offering a data catalog that is part of a platform that leverages all the aspects of the data governance organization and its most important assets, the people, to collaborate on generating this understanding and trust is the only valid offering.
Peter is a Product Manager at Collibra responsible for the Collibra Catalog Application. In the past, he worked as a product manager for a large automotive supplier and was in charge of Quality Assurance software.