In today’s data governance landscape, it is common to see cloud providers pitch a single platform for data quality. This platform touts an integrated process for the data plane (storage), the compute plane (virtual machines), and the control plane (ETL tools). The approach is pitched to help organizations avoid having to rely on a range of individual products (sometimes referred to as a “museum of technologies”). These “museums of technologies” typically fail to integrate well and often require extensive customization.
But before you jump on board and fully embrace this “one size fits all” approach, it is important to note some limitations. Most importantly – it confines organizations to a single cloud. You will be trading one museum full of disparate technologies for another—often at a significant cost. And while this may be attractive to some companies, you will find it incredibly limiting if your organization has embraced a multi-cloud strategy.
So, before you move to standardize your data quality approach with a single cloud provider, there are a few key questions you should ask yourself.
Some basic analysis of market data and pricing shows that a cloud provider DQ solution could cost you 66% more in initial kick-off as well as greater complexity (more cost) in routine maintenance of your DQ program. This cost, highlighted below, stems from the complex services in cloud integration and “name-brand” support which is only necessary given the single cloud provider.
|Cloud-provider DQ Solution
|Required for alternate cloud integrations
|Required given technical nature of DQ solution
|Ambiguous workflows driving maintenance
|Costly endeavor as branded Cloud Provider
|Data Management DQ Solution
|Minimal support due to cloud agnostic product
|Involved in DQ program, not required
|Reduced focus with integrated Data Governance
|Cheaper with focus on installing software
In addition to the clear financial implication, consider these other strategic qualifiers.
How will this single cloud data quality solution work with your source systems that do not adhere to this cloud provider?
Typically, organizations see two types of data quality issues – those from source systems and those that take place between systems. To mitigate these issues, organizations can try to tackle data quality by scanning and governance at different points in the process (between systems or at the source). With cloud-hosted DQ, you may find a push to focus between systems or data-at-final-rest. One should consider if that is architecturally the best approach or just a technology limiting feature.
To illustrate this, let’s take the example of a CRM tool supporting salespeople. Here, is it better to start scanning for data quality at the source or scan at the final place of rest and trace the issue back to the originating source system? To scan data in motion is a reasonable approach. But you could also argue that scanning operational data might slow your team down. In this case, should you scale up your compute? And by how much?
Further, when you do identify DQ issues, do you solve them by fixing your data locally or by addressing the source system that creates the DQ issue? Fixing locally might feel like an easy ETL, but it can leave stagnant source systems to fester with continued bad data creation. These questions and answers are the likely optimization that your organization will focus on building its DQ ecosystem.
As your source system may not be cloud-ready, your cloud provider may not support all types of sources. You may be forced to move your data into their cloud and be reliant on lineage to trace back to the bad source. Given the nature of proprietary data or the laws around privacy, would you be able to convince your vendor or local government to allow for your sole cloud provider technology?
If we go back to our Time-Value-Money routes in program management, this graphic helps drive home the value of a cloud agnostic DQ technology. What would you rather garner, spending more money and trying to meet more stringent requirements… or achieving faster DQ time-to-market?
Is their technology truly best-in-class or an afterthought to another focus?
When products offer DQ rules or checks, one should always ask how that DQ check originates. In most cases, companies will pitch proprietary software that locks you into their technology. Further, the pitched “rule generation” is almost always partial automation. Collibra Data Quality & Observability recognizes that black box DQ may not be feasible given your business, but we strive to deliver as close to a fully automated solution as possible.
- On profiling, cloud providers may sell discovery but neglect to mention that you are still writing your own rules and extensive configuration – even just to check column type. What are you doing to generate rules based on the most common DQ breaks? In today’s growing data environments, the environments cannot scale for manually writing the most basic checks, even if it’s only one time. Collibra Data Quality & Observability offers a robust solution to solve that dilemma.
- On rules, cloud providers may push for leveraging their ETL or scripting software to execute rules but not discuss the UX. Who is really writing your rules? In a robust data ecosystem, more often than not, rule writing is a complex interaction between the business user who knows the data system and that data steward who helps facilitate the process. To presume that every rule writer will know how to write a code snippet, let alone SQL, diminishes your organization’s ability to scale DQ. Collibra Data Quality & Observability champions the UX with low code rule writing in addition to rule discovery for analysts who want to leverage the code already written.
- On monitoring, cloud providers may offer systems to access the results and create a workflow, but fall short of delivering one that is seamless to both the user and the auditor. Have you considered how you would explain your end-to-end solution to an external auditor who may be familiar with your technology but is definitely not a Data Governance persona? Your DQ program will impact your organization’s decisions, and a deep dive is just a matter of time. At Collibra, we believe your process should produce metrics and transparency that speak to both Data Governance and executive personas.
Do you want to be stuck in the legacy DQ process?
We appreciate that DQ is getting a Lifecycle to help architect your organization’s process. Cloud providers would likely agree that this cycle breaks down into some form of connecting, profiling, specific checks, monitoring, and action. In fact, this Lifecycle aligns with Collibra’s model for achieving minimal viable data trust. Beyond that Lifecycle, beware a cloud provider that prescribes a process with suggestions on ownership and stewardship – effectively mandating their workflow. A suggested persona model helps any technology, but prescribing one can imply that the technology is not fungible enough to support your organization’s needs. What if you don’t have a robust Data Steward team, or an “army” of rule-writers? How could you manage the technology that forces you to apply their persona model when your organization is still building its process?
Ultimately, do you want to be DQ cloud-dependent or cloud-agnostic?
Your current cloud provider may be a great bet, but every major organization considers a competitive advantage and wants to avoid getting locked into any single provider. What happens if one of the providers incurs a monopoly and must break up its organization (reminiscent of Microsoft)? Could you imagine if your organization was still stuck on Lotus Notes for email? Do you really want to feel like your metadata is being held hostage? How would you stay competitive in the face of such possible disruption? Is the single cloud solution going to be your friend or foe? Consider these points and make a decision – do you want your DQ cloud-dependent or cloud-agnostic?