“Where did this data actually come from?” If that question causes a collective sigh in your meetings, you’re likely staring into a data lineage black hole. Custom code and modern data tools, while powerful, often hide data’s true journey, eroding trust and wasting valuable time. In fact, Forrester finds data teams spend up to 70% of their time just trying to verify and prepare their data. To build trust and accelerate insights, organizations need a complete, end-to-end view of their data’s journey, without any blindspots. That is why we are thrilled to announce our latest lineage feature.
What’s new: OpenLineage Integration
Collibra now supports OpenLineage, the emerging open-source standard for data lineage collection and analysis. This integration provides an automated mechanism to consume OpenLineage events within the Collibra Platform, allowing you to see a more complete, end-to-end picture of how your data is created and transformed across your entire data landscape.
This feature directly addresses the “lineage gap” created by modern data stacks that rely on a mix of tools, including custom Python scripts and ETL or orchestration platforms not yet supported by standard harvesters. By adopting this open standard, we are not only enabling you to capture lineage from a wider array of sources today, but also streamlining the development of future lineage integrations. The result is deeper data traceability, enhanced trust and a more comprehensive understanding of your most critical data assets.
How OpenLineage Integration helps
The modern data stack is diverse and dynamic, often featuring a complex web of technologies and custom-built code. This complexity creates lineage blindspots, making it incredibly difficult to perform accurate impact analysis, ensure regulatory compliance, or even trust the data used in reporting. Our OpenLineage integration is designed to illuminate these dark corners of your data estate.
This new capability solves several critical challenges:
- Unsupported source gaps: It captures lineage from custom ETL/orchestration tools and Python scripts
- Lack of end-to-end provenance: It provides a complete data journey, making it easier to meet audit and compliance requirements
- Inefficient impact analysis: It accelerates your ability to understand the upstream and downstream effects of a change to any data asset
- Difficult user onboarding: It helps new team members get up to speed faster by providing clear, visual maps of complex data flows
How OpenLineage Integration works
Our integration allows the Collibra Platform to ingest and display lineage information that adheres to the OpenLineage format. This enables Collibra to tap into a growing ecosystem of data processing tools that can emit lineage data through this open standard. From a technical standpoint, the process is straightforward and leverages your existing Collibra infrastructure.

Leverage OpenLineage to bring automated, end-to-end data lineage from Apache Airflow and AWS Glue into the Collibra platform.
For data sources that generate OpenLineage metadata (such as Spark jobs running on AWS Glue), the process begins with you collecting the lineage files. This is typically done by installing a third-party, open-source collection agent in your processing environment. Once collected, these OpenLineage files can be placed in a cloud storage bucket of your choice accessible by your Collibra Edge site. From there, Collibra’s lineage harvesting system seamlessly ingests the files, processes the information, and stitches the technical lineage into your data catalog.
Why you should be excited
OpenLineage can help all members of an organization from the data engineer to the business analyst:
- Data Engineer
- Standardize custom lineage: Finally, a standardized way to emit lineage from your custom Python scripts and complex Spark jobs, ensuring your hard work is visible and understood within the enterprise data catalog
- Validate data flows: Visually confirm that your data pipelines are operating as designed and that data is flowing correctly between systems
- Data Governance Manager / Compliance Officer
- Achieve complete auditability: Close critical gaps in your lineage to provide regulators with a complete, end-to-end record of data provenance for compliance mandates like GDPR and BCBS 239
- Mitigate risk: Confidently assess the impact of data quality issues or schema changes across systems that were previously a black box
- Data Analyst / Business User
- Build deeper trust: Gain confidence in your reports and dashboards by seeing the full, verified journey of the data, including the custom transformations it underwent
- Self-serve with context: Independently understand the origin and context of a data asset without having to track down the engineer who built the pipeline
Key use cases for OpenLineage
OpenLineage helps you:
- Trace data through a custom python ETL pipeline: A data science team uses a custom Python script to pull customer data from a PostgreSQL database, clean and transform it, and load it into Snowflake. By instrumenting the script to emit OpenLineage events, the data engineer can now see this custom process visualized in Collibra, creating a clear lineage line from the external API source to the Snowflake tables used by the analytics team
- Auditing an AI model’s training data: A financial services company needs to prove to regulators which datasets were used to train its new credit risk model. The MLOps pipeline, which uses an orchestration tool like Airflow that supports OpenLineage, emits lineage metadata for each training run. Collibra ingests this data, providing a complete and immutable audit trail that links the specific version of the model to the exact datasets involved in its creation
- Verifying a complex cloud migration with AWS Glue: An organization is migrating its enterprise data warehouse to the cloud using AWS Glue for transformations. By enabling the OpenLineage agent for their Glue Spark jobs, they capture detailed, column-level lineage of the entire migration process. This allows the migration team to use Collibra to visually verify that all data has been mapped correctly and that complex business logic was preserved, de-risking the go-live event

Visualize the source code facet from OpenLineage under the lineage transformation.
Key takeaways about OpenLineage Integration
Our integration with OpenLineage is a testament to our commitment to delivering a truly unified data intelligence platform. By embracing a leading open standard, we are breaking down the silos between custom-built systems and the enterprise governance landscape. This initiative reinforces the power of one platform, creating a single, comprehensive view of your data’s journey, no matter which tool or code was used to move it. It ensures that lineage is no longer a fragmented puzzle but a complete, actionable map that drives trust and accelerates innovation.
Catch up on all our recent announcements by watching the full June launch webinar recording.
How to get started
Ready to integrate OpenLineage into your Collibra Platform instance? See our documentation on how to bring metadata from OpenLineage standalone, Airflow and from AWS Glue into Collibra.