A data schema is like a blueprint for a building. Similar to the way a blueprint outlines the design and specifications of a building, a schema defines the structure and organization of data. It gives a complete picture of your data warehouse or data lake, with details such as name, type, and format of data.
What happens when these details change? Or when you lose one of the puzzle pieces from the complete picture? It naturally affects the associated operations and applications. Given a choice, you are not likely to change the schema because it takes considerable time to get everything back on track.
Schema changes happen more often than not. Just like blueprints getting changed when buildings need renovation. Tables get added, columns get renamed, and schemas get deleted. Keeping track of these changes is a nightmare, not to mention the surprises your warehouse or data lake can throw. What happens when you suddenly discover a change? What if your enterprise applications get in a spin when schemas don’t match?
As a data engineer, you want to stay on top of such changes and avoid business disruption. But there are reasons you miss them.
- The detection tool cannot cope with the overwhelming size of data storage.
- Workflows do not detect and assess the potential impact of schema changes.
- Workflows are not designed to notify you in time.
Most of these reasons are related to how you detect and notify the changes, and you can manage them with the right tool.
Why do schemas change?
There are several scenarios where schemas can change. The change can be planned or unplanned. Let’s look at some common use cases.
- Data migration: If data gets corrupted during migration, you may discover unexpected changes in the schema. Data reconciliation after the migration is essential to avoid such surprises.
- Updates in the database: Database updates often prompt schema changes when you
- Add features, fix issues, or improve the performance of your database system.
- Normalize the structure of the data to reduce redundancy and improve data consistency.
- Establish new relationships between tables.
- Remove existing relationships between tables when they are no longer necessary or relevant.
- Change in data platform: If your organization switches to a different database platform, it may need to change the data schema to conform to the new platform’s requirements.
- Shifts in business requirements: When your organization needs to accommodate changes in business requirements, you may want to change the schema.
- Add new fields to the data schema for new types of data to be collected and stored.
- Remove fields from the schema if certain types of data are no longer needed or relevant.
- Modify the data type of a field to reflect the nature of the data being stored correctly. For example, a field that previously stored only integers needs to be modified to accept decimal values.
- Introduction of additional data sources: The data landscape keeps changing constantly, and you change the schema when adding data sources.
- Updates in standards: When industry or technology standards change or new regulations get introduced, schema changes become inevitable.
What happens when schema changes are not detected?
If the schema changes go undetected, data quality gets highly impacted. You can lose data, or some data can become unusable. For example, if a column is added to a table, the existing data may not fit into the new schema. This situation can lead to data loss or corruption.
Some of the impact issues are discussed below.
- Data inconsistency: If the changes are not propagated correctly to all the systems using that data, it can introduce data inconsistencies. When its impact goes undetected for a long time, the cause of data discrepancies may be difficult to diagnose and fix.
- Performance issues: A schema change can also affect the data processing systems. For example, if the schema changes result in several additional joins or queries, they can cause significant performance issues.
- ETL pipeline failures: If the pipeline is not designed to handle schema changes, it can cause ETL pipeline failures. For example, if the pipeline assumes the schema will remain constant, it will break and not dynamically adapt to changes.
When you finally discover and understand the schema change, fixing the issues may require data downtime. During this period, the data systems will be unavailable, affecting business operations.
Collibra for automatic alerts on schema changes
Collibra Data Quality & Observability automatically alerts you on schema changes without any additional settings. It detects the schema columns and reads or infers their data types (varchar, string, double, decimal, int, date, timestamp, and others). In the case of JSON, XML, and some other formats, the schema columns are not in the header but nested throughout the document. Collibra detects schema changes for all these formats, so you don’t have to worry.
Collibra constantly monitors your data and examines schemas, tables, and columns for changes. It observes each dataset, and if a column is altered, removed, or added, automatically raises the event via its standard composite scoring system.
Getting started with Collibra
Schema changes happen, and they can hurt your data quality. But you want to know the changes before they affect your applications or operations. The right tool, like Collibra Data Quality & Observability, can automatically detect schema changes in your data warehouse or data lake. It notifies the changes in time, helping you avoid extended data downtime and potential business disruption.