AWS re:Invent Recap: Privacy, Governance, and Data Catalogs
After the action-packed week that was AWS re:Invent, it’s time to reflect on the lessons we learned and explore how we can take action. Rather than focus on the flurry of news (you can always go to their website to see all of the product announcements), we thought it was best to summarize the conversations that were had to help coordinate our focus as we head into 2019.
Below are our summaries of some of those topics.
Data Catalogs are a New Focus for Enterprise Architects
For the last few of years, there has been increasing volume of conversations on driving data availability and usage through the use of a catalog, typically within a single business unit or across the small group of passionate data evangelists and scientists. Based on our experience at re:Invent, however, it’s rapidly becoming an enterprise-wide challenge that the Enterprise Architect is challenged to solve. The rationale for this is pretty simple: to meet the broader challenges of AI/ML or business imperatives like Customer 360, it requires aggregating data from across the entire organization. But it also creates a series of challenges. This was a message reiterated throughout several speaking sessions at this year’s event.
The first challenge discussed was that any catalog solution needs to be open and supported by strong native integrations with the most relevant enterprise data platforms, and they must be easy to integrate with via a solid API layer that is supported by a wide ecosystem of implementation expert.
Second, automation on the ingestion, including profiling and tagging, is paramount (this is the largest chasm to cross when setting up an enterprise catalog). This is meant to speed up and handle the vast volumes of data that needs to be added. There were many discussions on how AI/ML is advancing in this area.
Last key challenge was that it needs to be inclusive of not only data assets, but all of the follow-on assets including analytics dashboards, workbooks, and worksheets.
Data Governance Along with Catalogs, Accelerating (Multi-)Cloud Push
While there was a lot of buzz at the show around new ways to use data (Amazon Quicksight (BI), Amazon Forecast (ML), AWS SageMaker (ML) updates), many of the conversations we had were on how to handle the management and accelerate the movement of data so that it can be used in all of these developing areas. The challenge here is how to provide visibility across the data landscape, which is still a mix of on-premises and in the cloud (and, increasingly, multi-cloud).
While there is recognition in the value of a data catalog here as well, it is clear that companies have looked a bit deeper at the problem and requirements to solve it. Discussions expanded to include how to manage workflows across the key players involved, which include Analysts, Data Engineers, and Data Stewards. This facilitation and collaboration between the business and IT organization were viewed as an imperative since most tools used today are intended only for the deep experts and not for the faint of heart. Ultimately, they’re not getting the engagement and usage out of their lake/warehouse as they (or the business) expected.
Another point raised was that with the rise of Kubernetes and Docker, it’s increasingly easy to spin up siloed and highly temporary mini data lakes to perform ad-hoc data analytics or machine learning data workbenches. Also, in the case of these ad-hoc data islands for data scientists, it is of the utmost importance to understand the data lineage and data provenance. The fact that applications and data are becoming increasingly “mobile” places more of a focus on lineage. This need is viewed as not only providing traceability from analytics down to the source, but also increasing visibility on who or what (people, process, and tools) use what data. Collibra has seen this visibility provide better quality data because it is both proactive and reactive in that it uncovers errors before they becoming a problem, but also helps in resolving issues when they arise. This is one factor that delivers improved trust in the data by the end user which, in turn, increases usage.
I am sure these areas will continue to come to the forefront, especially considering that AWS mentioned they had more than 10,000 companies using them for creating their data lakes.
Privacy Now a Focus of a CIO and CTO
The last big topic that came out of re:Invent was data privacy. With the advent of new regulations like GDPR or California Consumer Privacy Act (CCPA), and the volume in the press on latest public breaches, organizations are demanding a more systematic approach to data privacy vs. the ad-hoc way it is done today. This shift at re:Invent was more pronounced since the CIO and CTO are now the individuals raising the question. The challenge is that companies have not been sitting idle, rather they have implemented a vast array of point technologies (and still some manual) across privacy, data protection, architecture, ontologies, policies, etc.; yet, there is no system of record across these areas. We see 2019 as being a year where “privacy by design” becomes the model organizations adopt to provide a true record of processing activities and point of integration.
So, what should the focus be as we head into 2019 based on this insight?
- Catalogs will be a critical capability in the new year, but they will be enterprise in scope, creating a broader set of requirements.
- Analytics and AI/ML in the cloud will be a driving need for broader access to data. Data catalog and governance including stewardship, workflow, and lineage will serve a key foundation to accelerate adoption.
- To allow this data transformation to flourish, privacy by design will need to be an adopted model going forward.
- Collaboration and communication across all levels of the business around the data will continue to take hold. Targeted user experience templates and crowdsourcing will increase in the requirements of solutions.
Want to continue reading about data catalogs? Download our e-book, A Comprehensive Guide to the Data Catalog, to learn more.