Data lineage diagrams: A paradigm shift for info architects

Collibra Data Lineage Diagrams

Our data landscape today and why it is a problem for an Information Architect
Many companies – especially in Financial Services, Healthcare, and others – have a hugely scattered application landscape. From front to back office systems, over several data warehouses, organizations have many local and global single points of the truth and a vast diversity of business information reporting tools ranging from plain good-old MS Excel to the more popular BI tools like Qlik and Tableau.

An Information Architect – who is known by many many job titles such as Data -, Application -, Solution -, Process – , Software -,  … Architect, but for the sake of clarity let’s just call him or her an  Information Architect – is responsible for bringing clarity to this gigantic spider web of data sources, systems, files, interfaces, processes, data warehouses, regulatory reports, internal management report, public shareholders disclosures, and dashboards, as well as the more diverse big data lakes and systems.

Information Architect Confusion

Source: LinkedIn

Very often, none of these systems are adequately documented, and even if there is documentation, it is often outdated. Sounds familiar, right?

The trend with cloud data warehouses, software-as-a-service, big data, the internet of things is certainly not going in the direction of a consolidation and centralization of multiple data sources into one single data location. Instead, it is quite likely that our current spider web will just become a massive BIG spider web and therefore the problems IAs face today will only become BIGGER tomorrow.

How does an IA tackle this problem today, and why this is not working
So how does a regular Information Architect tackle this challenge to create a nice, easy-to-navigate, easy-to-understand, easy-to-maintain, easy-to-document, and more importantly, easy-to-consume architectural picture of this application and data chaos? And how do they do so when faced with time pressure to comply with demanding regulations like GDPR, BCBS 239, CMS, and others?

Well, probably one step at a time and one data flow at the time. But what’s the starting point? Do you start at the end with the reports? But which reports first? For financial institutions, it makes sense to start with your compliance report models (ex. the European Data Point Model): where do the numbers in my report come from? Healthcare institutions might start with the systems that provide an adequate picture of patient history. For other industries, it will be other starting points for sure. A popular approach is to use a Critical Data Elements methodology.

So first our IA will spend numerous days, weeks, and months investigating and talking to the different SMEs of all those different systems and business processes. He will capture all of this information and write it down (in another file somewhere on the network).

Why isn't this working

Source: Using Microsoft Visio to Reverse Engineer a Database

As a next step, our IA will pick one stream and he will design an elaborate architectural picture of different systems and applications interacting with each other, including how the data flows from these systems to the different data warehouses and how the data warehouses feed the different reporting tools and how those tools produce hundreds of reports. Hopefully he will use supporting classical data lineage tools as there are many on the market to automate some of that work.

Next, our IA will publish these architectural beauties and will distribute them in a read-only PDF format to the different business users and analysts within different departments and ultimately he will find out that nobody uses them. Why? Because everybody has a different background and a different vocabulary (business versus technical language), a different need for granularity of information (management wants a high level picture, a mortgage loan specialist is looking for a more detailed picture, auditor wants to see it all and be able to go into the nitty gritty details). Even the DBA needs to understand the context for data.

As an end result a lot of time, effort and money is spent on designing good looking architectural pictures, which – understandably – do not accommodate everybody’s needs, quickly become outdated, and never hold the level of detailed information and documentation that is required.

And even when the architectural pictures are good enough, the consumers are faced with the traditional governance challenges:

  • Where can I find them?
  • Who owns them?
  • Who maintains them?
  • Who can help me explain them?
  • Are they still up-to-date?

How many of those architectural pictures are just consuming disc space? Or worse, how many are kept up-to-date at a cost nobody wants to have on his budget?  Clearly the current approach isn’t working.

It’s time to rethink the paradigm
Collibra has the solution for all IAs, CIOs, and CDOs who need to sponsor and are, in the end, accountable for a properly managed, documented, and controlled IT landscape.

Our goal is to inject a completely different way of working into your IA’s DNA. We aim at providing state-of-the-art and intelligent data ingestion, profiling, and data lineage embedded in our market leading data governance platform. With our tooling and platform, we want to automate what can be automated, but complement the automation with easy, collaborative and flexible ways of crowdsourcing the data lineage and governance.  As an IA you are not alone, you can rely on the data citizens to effectively work with you and bring clarity in the BIG spider web of systems, applications, and reports.

Rething with Collibra

Edit a definition of a Business Term from within the diagram

Carefully aligned metadata and data lineage focused capabilities

Our next generation visualization, combined with our new Collibra Catalog product, seamlessly integrates with our Data Governance Center platform, allowing IAs to:

  • Design a “to-be” architectural application landscape based upon existing, automatically-generated technical and business lineage between applications, data warehouses, and reports
  • Identify all critical data elements and specify the data quality controls and rules from within the “to-be” architecture
  • Launch from within your interactive diagram, standard workflow to trigger review of items visualized on the diagram to create, validate, or ask approval of linked business terms
  • Govern those architectural diagrams as semi-static / semi-dynamic pictures, define different types of ownership and subject matter experts, manage change and versions of those diagrams, and ask questions to stakeholders from within the diagrams by tagging users
  • Compare governed, certified, and approved to-be diagram with the current auto-generated as-is situation based upon ingested meta data with Catalog
  • Automatically highlight real-life changes in your application landscape (pulled into Collibra with Catalog) onto your approved and distributed IT application landscape
  • Visualize different overlays which show application ownership, process ownership, data quality scores, automated or manual processes, data sharing agreements, data usage and access request, number of outstanding data issues, and more
  • Log new and process existing data issues from within the diagrams based upon those data quality and ownership overlays
  • Via the Google map-like traceability, find the most data-quality-proof flow between back office system A and regulatory report system B
  • View your data lineage with a different glasses on: a business view, a technical view, a security view, a QA view, etc. using our out-of-the-box views and easy filtering capabilities
  • Crowdsource the information and documentation that is an essential part of the architectural design, which is actually the only sustainable way to keep it up-to-date, ie allow your data citizens to wikipedia-like feed in a controlled, but yet flexible and open way
  • Allow all data citizens to contribute to the accuracy of a diagram and the documentation of the items visualized on the diagram either pro-actively or via interactively guided governance workflows
  • Crowdsource via labelling, rating, and liking the data lineage diagrams and allow machine learning algorithms to propose data lineage diagrams which you are looking for based upon key search words, context, and your user-profile
  • Drill down from a high level data lineage diagram (eg application level only) into a more detailed part of the diagram showing the different ETL flows from one application to another, zooming further into a specific ETL from one set of tables to one destination table, zoom further in on the specific mapping at column level and the mapping logic between one and the other
  • Request from within the diagram access to a data set linked to a system, data warehouse or file on a network
  • Automatically tag private or sensitive data elements via machine learning algorithms and visualize these on the data lineage diagrams using our overlay capacity
  • Utilize intelligent search capabilities inside and outside the data lineage diagrams
  • Create a 3D visualization with more enhanced insights due to possibility to visualize multiple detailed layers in one single diagram
  • And many more robust features to come

Data lineage diagrams: an IAs best friend
While it will be extremely hard or impossible to enforce a company-wide single point of truth for the data flowing through your IT landscape, Collibra will help you to govern, design, document, quality stamp, keep up-to-date, certify, and distribute your data in no time and in a way that it is easy to maintain.


If you would like to participate in usability tests, beta tests, or feature brainstorms, then please subscribe to our User Participation Program.

Subscribe here!

Related resources


Unlocking business opportunities through data lineage


Simplify impact analysis with automated lineage


Fast-track your cloud migration journey with data lineage

View all resources

More stories like this one

Aug 21, 2020 - 3 min read

Trust your data: why you need a governed data catalog

Read more
Aug 7, 2020 - 3 min read

Say goodbye to duplicate data spending

Read more
fast-track your journey to the cloud
Mar 6, 2020 - 4 min read

Fast-track your cloud migration journey with data lineage

Read more