Defining Data Lineage: A Beginner’s Guide
Back by Popular Demand!
As the topic of data governance becomes increasingly more important, it still amazes me how many people approach it from the wrong angle. So many times, the conversation revolves around technical, system-oriented challenges. And it’s not really surprising that this is the case. For years, the ETL and metadata management vendors have been putting a “sticker” on their products and touting that they provide “data governance.” And while they do provide some metadata and technical data lineage capabilities, in reality, these tools are just bigger mops for data janitors. Let me explain what I mean with an example.
The business always has a need to access information, and often this includes moving data between systems. When IT needs to integrate systems, they determine the data that needs to move based on requirements and analysis of the sources and targets. They document their findings and designs, usually in a flurry of Word documents, Excel spreadsheets, Visio flowcharts, or all of the above. This includes details on how the data will be moved, including how frequently the data needs to move (daily vs. hourly vs. realtime), quality thresholds that need to be respected, which rules need to be checked, and more. After analysis and design the solution needs to be implemented, and someone in IT builds the code (an ETL, a script, …). Before the solution goes into production, it is tested. At each of these points the organization knows exactly where the data came from, how it is being used, and how it moves between systems.
Now fast forward six months. The people who worked on the original project have moved on. The documentation of the design is misplaced, or worse, completely missing. Any revision to the integration, or understanding of how changes may impact the system – and more important, the business – requires reverse engineering and analysis rework, including making all the mistakes that had been made.
Multiply this by all data movements already up and running, and the ones being planned and built, and it is clear that IT has a mess on their hands. And since they need to solve their immediate problem, they look to the tools they know – the data management tooling that is quietly humming away processing data bits. There is a belief that these these tools will somehow reverse engineer the solution IT originally put in place. They try to scan all sorts of data processing code to tell the business where the data came from. The problem, however, is that the outputs are too low level, incomplete, and basically meaningless to the business.
It’s a very reactive approach that I compare to cleaning a house flooded with water. IT’s reaction is to get a stronger, bigger mop. This seems like it may help them clean up their mess, but it doesn’t provide a true solution to the problem at hand. Simply put: IT has the wrong tool for the job.
The better approach, in my opinion, is to proactively stop the water from flooding the house in the first place. In the case of data governance, this means putting a control process in place right from the start, or in our example, formalizing the process already in place, including business and IT interaction. Make sure that you’re not building things that people can’t find. Use true enabling artifacts such as mapping specifications and data sharing agreements to proactively drive the process. Create system sensors: control points that scan the source and target systems when something has changed, and fire off an issue notifying stewards. Make sure that the business is only responding to the exceptions, instead of making exceptions your business.
Collibra is the right tool for the job. It provides all of these capabilities – and more. It empowers all data users to become data citizens through true data governance. And it helps you to drive real value from your data by enabling you to first understand where your data is coming from, where it’s been, how it’s being used, and who’s using it. See, Collibra isn’t just buzzword compliant. Collibra IS data governance.
Stan is the co-founder and CTO at Collibra and leads the global product organization. He’s responsible for product management and UX, Collibra’s Center of Excellence, and Collibra University, Collibra’s online learning platform. Prior to founding the company he was a senior researcher at the Vrije Universiteit of Brussels, a leading semantic research center in Europe, performing application-oriented research in semantics.