Gain full visibility across your data landscape, find meaning in your data and improve the quality of business decisions.
Discover and download solutions and pre-built integrations for the Collibra Platform.
Get unparalleled value through the combined expertise and unique strengths of our people and technology.
See how security plays a key role in everything from how we build and deliver our platform to how we hire and train employees.
Collibra Privacy & Risk
Discover and understand data that matters so you can generate impactful insights that drive business value.
Understand your ever-growing amount of data in a way that scales with growth and change.
Show how data sets are built, aggregated, sourced and used, providing complete, end-to-end lineage visualization.
Build customer trust by operationalizing privacy policies and scaling compliance across new regulations.
Modernize your operations with a solution that is scalable, accessible and resilient: data in the cloud.
Drive digital growth and customer engagement by breaking down data silos and adding value to customer interactions.
Fuel your self-services analytics with the right data to develop unique business insights.
Innovate for the future while successfully navigating the complex web of regulations.
Transform decision making in the public sector with secure Data Intelligence that is FedRAMP Authorized.
Cloud ready data
Government and public sector
Tap into our knowledge base by connecting, sharing and learning from your peers in our Data Citizens community.
See how Collibra is helping global organizations unlock the value of their data.
Find the resources you need to accelerate time to value and fuel your growth.
Learn from the leaders in Data Intelligence through our individual courses, learning paths, and certification programs.
Data Citizens '20
Take your data strategy to the next level by arming yourself with the knowledge you need to achieve Data Intelligence.
Get advice, tips and tricks from our product experts and industry thought leaders to learn how to make your data meaningful.
Join the world’s largest virtual gathering of professionals focused on empowering businesses to deliver on strategic goals through Data Intelligence.
Check our upcoming events calendar to discover exciting opportunities to learn from our product and industry experts.
Connect the right data, insights, algorithms and people to optimize processes, increase efficiency and drive innovation.
Read our latest announcements, news coverage and thought leadership articles.
Find an opportunity to challenge and be challenged, and work with some of the most talented people in the business.
Get in touch with a member of our global team by locating an office near you, calling us or sending an email.
Organization, Context and Judgment: Going Further on the Data Intelligence Journey
We are tasked with assessing why a company is experiencing a high rate of customer churn. How do we drive for an outcome that is accurate, actionable, and low effort so the company can prevent future churn? We’re on a Data Intelligence journey in 12 steps. Here’s where we started; now, let’s consider:
In this series, we’re on a journey. We’re following Cliff, a business analyst who’s been tasked with trying to find out why his company — which is doing well on many fronts — is experiencing a concerning trend of high customer churn. Given the large numbers at stake, the company must move quickly but they must accurately uncover the root cause and prescribe an action plan that works. Inaction is bad. The wrong action is worse. The answer is in the data. But what data?
This is the foundation of Data Intelligence. Data belongs to every knowledge worker and should flow through the organizational ecosystem in such a way as to let business professionals connect, communicate and collaborate in every way they need and choose. We’re with Cliff as he tries to find solutions to this real-world problem.
In the last piece, we covered the first three steps of this journey:
Although arduous and sometimes tedious, skipping or rushing through these steps is tantamount to surrendering to the lure of a quick answer without substance. So, let’s move forward.
Most companies have vast quantities of data and data sources fragmented throughout the enterprise. Cataloging is the process of discovering and registering, with context, these vast data holdings as well as artifacts that use data, such as Reports, API’s, Algorithms, etc., so that knowledge workers like Cliff can easily search for and locate items of interest and need.
This seems like the most fundamental aspect of data management, but again, the obstacles are considerable and time-consuming. Cataloging encompasses:
Cataloging starts with discovery — identifying and distinguishing between databases, reports, algorithms, APIs, topic and more. It organizes (and reorganizes) the data into accessible fields like tables/columns, and tracks data movement, such as from workbook to report. It not only accelerates machine learning but highlights the data used in the process.
The goal is to identify every data-related element and associate it with its logical peer in your canonical domain model (from Step 2). As an example, in Step 2, imagine you created the Customer Domain Model and one of the logical attributes of the Customer is Date of Birth. While cataloging your Salesforce Automation solution, you discover a physical column with a heading of Attr_Dt. Short of specialized knowledge, it would be challenging to determine what this physical attribute represents. The next logical step a data steward might take is to assess the table name, the adjacent column names, and even sample data in the Attr_Dt column. This could take minutes or more. Now imagine you have 5 million physical columns like Attr_Dt. At 1 minute per physical attribute for 5 million physical attributes, this effort would take a data steward nearly 40 person years to complete (8 hour days, 261 work days per year). Given how unacceptable this would sound to anyone, the act of cataloging requires automation to not only discover but to contextualize, also known as classifying, your data vis-a-via the logical domain model. By automating the logical next steps – table name, adjacent columns, sample and content inspection — through the use of a classification algorithm (machine learning), companies can radically reduce 40 person years into months, weeks or even days of manual effort.
It is not appropriate to consider cataloging an event; rather, it is an ongoing process. New data sets not previously cataloged will appear. That said, you do not have to catalog everything in your ecosystem to start delivering value to your users, like Cliff. The bootstrapping of your Data Intelligence graph – associating and linking nodes via cataloging – invites adoption, use, and contributions. It is the equivalent to opening your store for business before you have stocked every shelf; your consumers will be happy to make suggestions and comments on what else you need to focus on.
Continuing with the store opening metaphor, where the foundational steps (Steps 1 – 3) are the equivalent of creating the physical building and Cataloguing is the equivalent of stocking your shelves in a thoughtful (classified) way, it is then logical to think that a shopper like Cliff will want to know a bit more about the item(s) on the shelf before making a decision to place in her or his shopping cart. One such item of considerable interest is where did this data come from and who else uses it? The incorporation of data lineage to your knowledge graph addresses these specific questions. Not unlike cataloging, data lineage requires a discovery and harvest method. There are various ways to discover and harvest data lineage information, such as from SQL (e.g., stored procedures), ETL/ELT technologies, Reporting/BI Platforms, and code scans. A processor of the harvested information should unearth, at its basic core, a physical element (node 1), a second physical element (node 2), and any logic (e.g., transformation) placed on the value from node 1 before it is inserted into node 2. These nodes can then be linked back with the cataloged nodes (called stitching) and a link (or edge) is made between the two nodes in the Data Intelligence Graph. Lineage helps organizations connect different systems and processes to offer a complete picture of how data flows across the enterprise, at the conceptual, logical and physical layers. Where cataloging enables discovery at rest, lineage how it got there and where it goes from there.
Data lineage reveals how data is transformed through its life cycle across interactions with systems, applications, APIs and reports. It automatically maps relationships between data to show how data sets are built, aggregated, sourced and used, providing complete, end-to-end relationship visualization.
This adds accuracy and understanding to raw data, enhances trust, and fuels sharper inferences and business insights. It even enables impact analysis at a granular level — columnar, table, or business report — of any changes to downstream systems.
This is a strategic advantage, and it reflects recent advances. For much of the digital era, data architects had to build relationships manually between large data volumes to create lineage graphs. Newer technologies allow for most of this work to be done (almost) automatically, and much more efficiently. Today, by extracting lineage automatically from dispersed source systems and keeping it up to date, organizations can devote resources to strategic initiatives rather than endless data mapping.
Besides the clear business advantages — such as helping Cliff identify patterns in customer behavior — lineage can play a key role in ensuring compliance. A technical lineage view allows users to visualize transformations, drill down into table/column/query-level lineage, and navigate through data pipelines. This is important for providing necessary information to regulators.
Again, referring back to the shopping example for Cliff, a common request when assessing between 2 or more options is the ability to compare your options based on what is important to you. Let’s suppose that when Cliff is looking for a data set for his or her churn analysis, Cliff believes that Age is an important criteria to segment on. As Cliff evaluates his or her options for Customer data across a range of systems (e.g., Salesforce Automation, ERP, Order Management, Web, etc) that have been cataloged and classified to the Customer Domain Model, it would be beneficial for Cliff to determine which of these data sets offers the most ideal data quality and veracity. Using the shopping example from prior, Cliff wants to look at the ingredients and compare them with his or her options.
A common method for providing visibility to the “ingredients” is to Profile the content of what can be found in a given physical column of data. Profiling extracts statistical information such as how many rows, % null, % invalid, frequency distribution, minimum length, etc. This information might be useful to some users steeped in data science and data quality, but for shoppers like Cliff, this is largely unreadable or requires too much effort to use as a guide to rapid decision making. But imagine if you could compute a Score based upon all the statistical information gathered from Profiling. And further imagine if this Score would help you quickly rank and order your options, making your choice for the best data set to leverage easy, accurate, and trustworthy. Cliff will be able to look at his or her options side-by-side, quickly determine what is best for his or her analysis, and confidently put the request in the shopping cart.
In business terms, these capabilities offer Cliff greater freedom to develop his own analyses. If he’s looking at subsets such as customer age range or zip code or buying frequency to identify patterns behind the churn, he can work only with the most trustworthy data, rather than all the data or run the risk of working with the data of least value.
To summarize, first we built the shopping center. Second, we acquired the items for sale, organized and placed the acquired items on shelves for easy discovery. Then we made it easy to determine where the item came from, what happened along its journey, who else uses it, and where does it go from that point. And finally we provide easy to use ingredients and comparison scores to provide visual clues in support of comparison shopping. Cliff is extremely productive and confident about what he wants, but it seems like it will be complicated to gather and assemble it all.
Cliff’s journey is picking up speed. We have a few steps to go. Stay tuned.
Jim’s charge is to deliver world-class solutions that can empower customers to disrupt and lead their respective markets.
© 2020 Collibra. All Rights Reserved.
A message to our Collibra community on COVID-19. Read more from our CEO.