Cloud data platforms have become a key component of enterprise data architectures, playing a central role in many organizations’ digital transformation strategies. That is not simply because they offer multipurpose, scalable and cost-effective storage. It is because they foster more agile data operations, cut through siloed architectures and unlock the potential of artificial intelligence and machine learning to drive new trusted business insights.
Those core benefits help address many of the challenges that companies face in executing their digital transformation programs. Across all industry verticals, companies are collecting much greater volumes of data, which points to the need for scalable, cost-effective solutions. They are also collecting a greater variety of data (including structured, semi-structured and unstructured data) that is difficult to describe via a single schema. This points to the need for multipurpose storage. Most importantly, they need to quickly derive insights from these diverse datasets, which points to the need for agile data operations and sophisticated analytics (particularly AI/ML capabilities).
What do we mean by a cloud data platform?
As with many terms in the world of enterprise technology, a ‘cloud data platform’ can be somewhat ambiguous, which is why we will start by defining exactly what we mean by it. From a functional perspective, when we refer to a ‘cloud data platform’ we reference all the tools that an enterprise needs to collect, process, store, analyze and visualize data. While Google Cloud Platform’s (GCP) capabilities are always evolving, as new services are launched, here are some insights into tools currently on offer in each of those categories:
Data collection: cloud data platforms need to aggregate data from a number of different sources, including real-time updates and batch transfers. GCP offers the ability to on-board streaming data via Pub/Sub and IoT Core services, and provides a range of batch upload options via Data Transfer.
Data processing: data on-boarded from source systems typically needs to be pre-processed before it is stored to support further analysis. GCP offers a range of tools to support these processes, including Dataflow for streaming data, Dataproc for Hadoop/Spark stacks, Data Fusion (for integrating data from multiple sources) and Dataprep (for data wrangling).
Data storage: most enterprises need a combination of data lake and data warehouse technologies to support their business intelligence and data science teams. Data lakes like Google Cloud Storage are required to accommodate a full variety of data types – particularly unstructured data sources, but also structured data in its raw form (before it has been pre-processed). This versatility lends itself to several use cases. Data lakes serve as a repository for raw source data, a staging area for data as it is prepared for further analysis, a central hub for self-service business intelligence – or a combination of those functions. Data warehouses, on the other hand, serve as a central hub for structured data that has been processed into a common schema and is therefore ready for further analysis. GCP’s data warehouse solution BigQuery is particularly relevant for scalable enterprise analytics. Its distributed architecture not only offers highly available and durable storage, but also helps to support query performance and scalability.
It is important to note that processing unstructured data typically yields outputs that are structured by nature. Take the example of a firm looking to analyze customer conversations collected via multiple channels (call centre audio files, email and instant messaging). The textual content of those conversations (once audio files are transcribed using speech-to-text) is unstructured by nature and would not necessarily be ingested directly into a data warehouse. However, natural language processing engines can be used to score those conversations and determine levels of customer satisfaction. In doing so, raw unstructured data is turned into structured data that can be further analyzed in combination with other sources as part of broader investigations into customer churn.
Data analysis: once data has been collected, processed and stored in its required structure, it is ready for further analysis. This can take a number of shapes – from a simple query to more complex calculated metrics and analytics, all the way through to machine learning models designed to detect new patterns or drive predictions. BigQuery offers a full range of analytical functions, including specialized capabilities for streaming and geospatial analysis, as well as machine learning.
Data visualization: once data has been analyzed it needs to be presented in a visually intuitive manner. From simple line and bar charts through to more complex geospatial visualizations, the key to most business intelligence tools is to clearly display patterns and allow data to be more easily understood. GCP recently acquired Looker as its own in-house business intelligence solution, but also partners with a range of third parties, including Tableau and Qlik.
Why does data need governance on a cloud platform?
Cloud data platforms have proved very effective at fostering agile data operations. That is not only because they negate the need to manage hardware, but also because of the variety of tools and depth of automation available to support the processes just covered.
However, some of the core benefits offered by cloud data platforms can also result in unintended consequences. The ease with which organizations can store greater volumes and a broader variety of data at a lower cost has the potential to contribute to poor housekeeping. Without proper governance, organizations are likely to run into challenges relating to:
- Data Quality: a lack of ownership and accountability can lead to poor control over source data, resulting in high levels of overlap or redundancy. A lack of contextual information also makes it hard to ascertain which source is the most complete, accurate or current.
- Data Discovery: cloud data platforms hold great promise for business analysts and data scientists. Being able to access enterprise data assets from a single location may sound like a panacea. But without contextual information, they cannot know which sources to choose, how to interpret that data, or whether to trust in its accuracy.
- Compliance: poor data governance can also pose risks. Most organizations face a myriad of rules that impact the way they manage data. Some regulations, like GDPR and CCPA, provide data subjects with greater rights over data (such as the right to have their personal information deleted) and place an onus on organizations to uphold those rights. Other rules set by industry regulators and tax authorities require organizations to retain data for audit purposes. Faced with an ever-more complex set of regulations, organizations need to take a data-centric approach to compliance – knowing where all sensitive information is stored, which policies apply to each dataset, what type of processing is permitted and how access should be controlled.
Addressing these challenges is precisely what has brought the concept of a ‘governed’ cloud data platform to the fore and has been the driving force behind the partnership between Collibra and Google Cloud Platform (GCP).
How do you ensure data on a cloud platform is well governed?
To unlock the benefits that cloud data platforms offer and mitigate any unintended consequences, organizations need to ensure that data and analytics are properly governed. This is where Collibra excels. Collibra offers a collaborative platform that enables organizations to better govern their information assets by helping to foster trust, discoverability and understanding in data.
To explain what we mean by governance, we have highlighted four strategies below that are key to the successful implementation of a governed cloud data platform:
Controlled ingestion: To ensure a cloud data platform is fed with trustworthy data, it is important that source data is properly governed and registered in a data catalog before being ingested. Capturing relevant metadata will help end users know exactly what data is available, what it means (providing business context), where it has been sourced, as well as how accurate, complete and consistent it is.
Building metadata-driven data pipelines: Just as cloud architectures have enabled more agile development methodologies, they have also fostered more agile data operations. Capturing relevant metadata through the controlled ingestion process provides data scientists with the information they need to build their own data pipelines. Rather than wait on central teams to prepare data on their behalf, they are empowered to select the source data that meets their requirements and have it transformed into the right structure for their analysis.
Certifying information assets: Data governance is a discipline that can be applied to more than just source data. It can be just as important to govern the components that facilitate data analysis – everything from specific queries, API calls, analytics and machine learning models, through to reports, worksheets, notebooks, dashboards and cubes. Registering these information assets in a catalog means business analysts can not only share their insights, but also the tools that generated those insights. This helps the entire organization to be more data intelligent – speeding the time taken to turn raw data into meaningful conclusions that can guide business decisions.
Access governance: Just as controls need to be placed over data ingested into cloud data platforms, they are also needed to govern how that data is provisioned to data consumers. Access governance needs to be supported by a granular understanding not only of datasets but also permitted use cases. Policies can then be configured to apply to specific data elements and/or categories, ensuring data is only provisioned under permitted circumstances. For example, each request might consider the purpose of the query, as well as the location and business unit of the data consumer — ensuring rules over data privacy and sovereignty are always complied with.
For more information
Across every sector of the economy, organizations are looking to make data-driven decisions to improve business outcomes. Cloud data platforms offer a powerful set of tools to support those data-driven insights. However, without proper governance of data and information assets, organizations will run into challenges with respect to data quality, data discovery and compliance.
Watch this webinar to learn how ATB Financial leveraged Google Cloud and Collibra to accelerate their digital transformation.
To find out more about how Collibra and Google Partners, check out our partner page.