See all blog posts Data QualityNov 18, 2022 · 4 mins read

Announcing Data Quality Pushdown for Snowflake (in Beta)

Announcing an exciting new feature – Data Quality Pushdown for Snowflake. The purpose of this beta feature is to create a faster and easier time to value for data quality users who are also using cloud databases. The new cloud-native vendors are showing workloads that can scale to hundreds of concurrent jobs, with auto-scaling and other functionality. One of the reasons this is even possible right now is because there are more user-defined functions (UDFs) and more machine learning (ML) capabilities available in cloud-native databases than ever before. Collibra has leveraged this growth to achieve a best-of-breed data quality and observability pushdown solution.

Running a DQ job without a pushdown option

When you run a DQ job without the pushdown option, you define certain parameters, such as the columns or the range you want. Then you also define some ML layers, such as Outliers or Patterns.

Now all this work requires processing, and an Apache Spark compute engine does it. You read the entire dataset defined by your parameters into Spark, which has its own memory, CPUs, and its own compute resources. It reads data and then does partitioning and sorting to execute the query. After that, it writes data out and then does more processing on it to get Outliers and Patterns.

The source data is sitting inside a database and it is read out. All the user requirements processing takes place in Spark. And then Spark writes all the results into the DQ Metastore.

What is Data Quality Pushdown for Snowflake?

Before explaining Pushdown for Snowflake, let’s look at what Snowflake is. It is a best-of-breed cloud-native data platform. The Snowflake data platform is not built on any existing database technology or ‘big data’ software platforms, such as Hadoop. Instead, Snowflake combines a completely new SQL query engine with an innovative architecture natively designed for the cloud. To the user, Snowflake provides all of the functionality of an enterprise analytic database, along with many additional special features and unique capabilities.

In the Pushdown model, the Collibra DQ Agent that creates the Apache Spark DQ Job is not needed. The pushdown is running the database engine to get this work done. The pushdown function sends the processing to the compute on Snowflake for less physical data movement.

Why do we need Pushdown for Snowflake?

Pushdown is an alternative compute option for running a DQ Job, where all processing for the data quality is submitted to the target data warehouse. To use pushdown, you can run a setup script that creates a dedicated Snowflake Virtual Warehouse and a service account user for DQ Job runs. This designated service account user will need read access on all schemas with the target data. Collibra will provide customers with a Snowflake Pushdown setup script which must be run to use this new feature.

A few more points explain why Snowflake Pushdown is a better alternative.

Compute resources: When a DQ Job runs in Snowflake pushdown mode, you can take advantage of the Snowflake architecture. It means that the scale is not limited. When there is a greater demand, the server nodes can auto-scale and then scale back down again as required.
Ephemeral bursting: A lot of processing on Snowflake can “burst” to 64 or 128 nodes. A large DQ Job working on millions of rows and hundreds of columns would cause Snowflake bursting. After the DQ Job, the system would scale back down. This feature is the advantage of the SaaS (Software as a Service) model versus adherence to static hardware.
Data Privacy: With Snowflake Pushdown, your customer data is never read out of the Snowflake environment. This feature is valuable for privacy regulation compliance and information security assurance.

So what exactly will Data Quality Pushdown for Snowflake do for our customers? It will auto-generate SQL queries to offload the DQ compute to the data source. It will reduce the amount of data transfer and remove the Apache Spark computation of the DQ Job.

In summary

Collibra Data Quality Pushdown for Snowflake (in Beta) unlocks exponential savings for customers with lower TCO, lower management costs, higher efficiency, and improved on-demand scaling. You can eliminate the need for a separate Apache Spark compute platform to run Collibra Data Quality & Observability.

Laurent Weichberger

Sr. Manager of Customer Success for Data Quality, Collibra

Laurent manages some of the Collibra DQ Customer Success clients, specializing in Apache Spark, and the DQ APIs. He has worked at a number of Big Data firms over the past ten years, including Hortonworks, DataStax, Cloudera, Databricks and more. He brings his vast experience to bear helping to discover DQ use cases, and on ensuring that DQ customers are successful with their use case implementations as our Big Data Bear. He lives in North Carolina with his wife, children and their cat Khadija.

Nita Dembla

Director, Software Engineering, Collibra

Nita Dembla is passionate about SQL and query performance. She has extensive experience in building and working on Data Warehouses which she gained in her previous roles at IBM, Hortonworks, and Cloudera. She joined Collibra in 2022 to lead the Pushdown initiative.

Evan Nowlin

Technical Writer, Collibra

Evan Nowlin is devoted to ensuring an excellent user experience. Since he joined Collibra in 2021, he has led the documentation efforts on the Data Quality team and helped refine new workflows as part of the Pushdown initiative.

Want to learn more about Pushdown?

Request a demo!

Collibra

We accelerate business outcomes by delivering accurate, trusted data for every use, for every user and across every source.

Why you need data quality and observability unified with data and AI governance

May 20, 2025 - 2 min read

It’s time to turn the volume up on AI: Collibra Data Quality &...

Apr 15, 2025 - 4 min read

Why you need data quality and observability for your data warehouse or lake

View all articles

Collibra sites

See all blog posts Data QualityNov 18, 2022 · 4 mins read

Announcing Data Quality Pushdown for Snowflake (in Beta)

Running a DQ job without a pushdown option

What is Data Quality Pushdown for Snowflake?

Why do we need Pushdown for Snowflake?