A recent Gartner survey of CDOs showed that their primary responsibility is for analytics. Yet users report that most of these initiatives have not lived up to their promise. This “analytics gap” has several causes, but it is exacerbated by the inability of data scientists and other data professionals to find the right data. Making finding the right data for analytics a priority should be one of the primary functions of any data governance initiative.
Finding the right data is difficult. Most organizations are drowning in data, and even if you know you have the data, pulling the bits you need out of the huge collections that live across your organization, is a daunting task. Unfortunately, there is no magic bullet, or magic machine learning algorithm, that can do this. While algorithmic approaches hold a lot of promise, they are just in their infancy today, and should be thought of as a means of assisting subject matter experts. The good news is that with the proper infrastructure, you can harness the expertise of data citizens around the organization to organize your data, and use that infrastructure to enable them to share their work.
The first step is to create a data catalog that organizes useful collections of data across existing boundaries. Whether those boundaries are systems, organizations, or geographies, it is the cross-boundary visibility that drives many of the more significant insights from that data.
This data catalog should help experts organize data in three ways:
1. The data catalog should provide an assisted mechanism to link data to meaning. The business terms, rules, processes, KPIs etc. that are related to some data are the real information that is used to determine whether that data is a good fit for the analysis in question. And different views of the data provide different aspects which may feed that analysis. Understanding the subtle differences between the backend financial transaction view and the website interaction view can give vital clues to buying behavior. Each of these things is a reflection of the same activity, but they have different meanings and it is those differences that help create accurate predictions.
2. The data catalog should suggest using the work of your peers who may be working with similar analyses. It is often the case that several people are working on different aspects of the same problem, and may be able to leverage each other’s data sets. The catalog should recognize when the work you are doing is similar to your colleague’s, and point you to data sets that they may have already created. This will simplify the process of finding relevant and comparable data.
3. The data catalog should suggest other data that might be related to data you have organized into a data set. This suggestion will help speed the process of determining what data to include. This is challenging, as there are many ways of determining this kind of similarity. But at the very least, it should determine when you have either structural or documented semantic relationships.
Discovering and organizing the data cannot occur in a vacuum. There must be the rest of the governance capability behind it, or users will not have the trust in the data. They need to know its lineage, its quality and the organizational responsibilities for that data in order to use it with confidence. Unfortunately, many of the data catalogs that exist today do not have these capabilities. They present lists of information, but do have not ties with the governance of the information. Without that tie, users cannot distinguish between all the various copies and states of the data. And, they have no idea which collections of data adhere to which policies, which have what quality, etc. They also cannot determine both how the data is used today, and how it is formally intended to be used (which are often different things).
Furthermore, most people do not trust the quality of their data. Having the quantitative results in an understandable format helps them evaluate that quality, and therefore the data’s suitability for use in the data set. Also, people naturally have a very hard time trusting something that is entirely controlled elsewhere. Having a direct connection to a data helpdesk which makes it easy to fix any problems is crucial so the data users understand that they can fix any problems that they do find. These links to governance are not optional features of a catalog, but essential to its proper functioning. Without them, the catalog might help you group, organize, or share, but it will not assist the development of new and useful analytics, because it will not have gained the trust of the consumers of the data.
As your organization builds out its analytical capabilities, creating a data catalog is a powerful step. To gain value from this data catalog, it needs to support the three capabilities that will truly help users and developers of analytics find the data. It also must be fully integrated into the data governance process and exhibit all of the governance capabilities (policies, quality, lineage, usage traceability, and repair) that create trust. Only in this way will you be able to get out of the analytics gap, and deliver true insight to your organization.