Automated Assembly: Distributed Query, Access Management and 360-degree Views for the Data Intelligence Journey
We are tasked with assessing why a company experiences a high rate of customer churn. How do we drive for an outcome that is accurate, actionable and low effort so the company can prevent future churn? We’re on a Data Intelligence journey in 12 steps. In Part 4 of this five-part series, we will cover the next three steps in the Data Intelligence Journey:
- Step 9 — Service Broker: A distributed and federated query and extraction engine that pulls the requested and authorized data from selected databases and systems and transfers encrypted data to a specified location
- Step 10 — Access Management: The enforcement of identity and access management policies (Step #3) to the extracted data prior to delivery and consumption
- Step 11 – Compositing: The process of making two or more extracted (customer) records that are determined (in Step #7) to be same (person) into a single, golden record prior to delivery and consumption
Our journey continues
Cliff, a business analyst, has been tasked with trying to find out why his company is experiencing a concerning trend of high customer churn. Given the large numbers at stake, the company must move quickly, but they must accurately uncover the root cause and prescribe an action plan that works. Inaction is bad. But the wrong action is worse. The answer is in the data. But what data?
This is the foundation of Data Intelligence. Trusted data belongs to every knowledge worker and should flow through the organizational ecosystem in such a way as to let business professionals connect, communicate and collaborate in every way they need and choose. We’re with Cliff as he tries to find solutions to this real-world problem.
First we built the foundation for a strategic Data Intelligence program. That led us to stocking the shelves in a highly organized way and with easy-to-use context. And then we prepared to pivot from a supply focus to a demand focus before we open the doors to our digital store. This is where professionals can use trusted data to:
- Conduct research
- Analyze patterns
- Identify problems and opportunities
- Collaborate with colleagues and partners
- Ensure security and compliance
But let’s remember our target audience, Cliff, and therefore the shopping experience metaphor that underpins this journey must respect the skills and capabilities of its users. Let’s continue our journey.
Step 9: Service Broker
Reflecting back on Steps 4–6 (Part 2), we shepherded Cliff through the discovery and selection process of the data sets that are most ideal for his churn analysis. Cliff was guided through the Data Intelligence graph, presented with options that were highly organized (classification), offered a clear indication of where each came from, anything that may have occurred during the journey, and where else does it go or is used (lineage and transformation), and detailed differences between the available options for selection (profiling and scoring). Cliff was able to quickly and confidently fill his shopping cart with the most appropriate data sets for his analysis. Now Cliff is ready to continue his self-service data shopping experience through an automated check-out process.
Each of the data sets in Cliff’s shopping cart is linked to a Data Owner, the person or department within an organization with overall responsibility for establishing and enforcing policies around access and use of the data set. Cliff’s request, as part of the check-out process, for one or more data sets must be accompanied by the intended purpose and use, dates of access and use and how he would like to take delivery. There are three primary ways for Cliff to take delivery of requested data sets:
- Borrow – A temporary method of accessing and using the data with a virtualized (physical copy not stored) technique. This delivery method is most consistent with Analytics, Reporting and Algorithm training
- Lease – A time-boxed method of extracting and loading data into an analytic data repository for subsequent access and use. This delivery method is most consistent with use cases requiring the combination with data not available in the Catalog and/or an Analytics platform that does not support virtualization
- Buy – A move of an unbounded copy of a data set by extracting and loading data into an analytic data repository for subsequent access and use. This delivery method is most consistent with migrating data from a legacy environment to a next-generation environment
Once Cliff has supplied all required information for check out, the request is routed to each of the Data Owners linked with the data sets requested by Cliff.
Data Owners, leveraging embedded workflow capabilities, may elect to deliver their Data Use Agreement (DUA) decisions manually or automatically. And if the Data Owner’s policy requires that Cliff acknowledge the DUA upon first use or every use, regardless of manual or automatic decision, Cliff will be presented with a Task in his work queue for such acknowledgment, when appropriate. After all approvals and acknowledgments have been made, Cliff is now authorized to check out.
Making great use of all the metadata and mappings from the act of cataloging and classifying physical data sets (Step 4), the Data Intelligence platform can generate all the precise instructions (query language) necessary to efficiently extract the data that Cliff is authorized to use at the Edge where the physical data resides. If multiple data sets are requested by Cliff and they reside in physically different data centers, these instructions will be pulled by the appropriate Edge components for execution. Each Edge component will then connect, authenticate and extract data from the underlying database or system using the precise instructions generated by the Data Intelligence platform. This extracted information can then be encrypted and transmitted to the requested destination (e.g., Borrow to a cloud-enabled, Elastic container or Leased to S3 folder to be loaded into BigQuery).
Each set of query instructions, one per data set, will be run to completion; delivering the requested data to the requested destination. When two or more data sets are requested, each data set will be delivered to the same destination (e.g., Elastic container, S3 folder, etc) and fused into its canonical format (Step 2) yet retain its mark of origin/provenance. In this way, all data that is brought together shares the format and shape of the logical or canonical model regardless of how different or unique each underlying data set may be. Presto chango, Cliff has what he asked for at the click of a button and all because of the investment you made in the foundation building, stocking the shelves and providing easy-to-use clues for anyone in your organization.
Step 10: Access Management
Before we can deliver the extracted information to Cliff, it is vital to make sure that identity and access management rules are adhered to and the principals of regulatory concerns, if appropriate, are enforced. When data access policies are straightforward and can be addressed at the time of extraction (e.g., Social Security Number must be masked or removed) without awareness of any other requested data, it can and should be enforced as part of the extraction instructions, where it is most efficient. However, most data will be extracted through direct (e.g., JDBC) connection and thus bypass any possible use of Single sign-on (SSO) and Role-based access control (RBAC) at the application level. Further, some enforcement cannot be achieved prior to or at the time of data extraction. For example, some attributes by themselves (e.g., Last Name or Date of Birth) are not considered Personally Identifiable Information (PII); however, in combination with one or more attributes (e.g., Last Name + Date of Birth), can become PII. Thus, the request for and availability of identity attributes must first be determined after data extraction and only then can access policies be enforced. In sum, Access Management policies should be applied at the most efficient and appropriate time during the process – (1) the generation of the instructions may exclude specific attributes or tables, (2) data can be fully or partially masked as part of the extraction, or (3) assessment of all extract attributes for a record may result in removal and/or additional masking of attributes.
The purpose of Access Management in the context of this 12-step Journey is to filter, remove, mask or in some other way enforce the access policies described with Governance, Data Use Agreement and Privacy & Risk policies established in prior steps and made visible to Cliff during his shopping and check-out process. After enforcing Access Management, the remaining set of data includes everything that Cliff has requested and is authorized to work with, all without writing any complex integration code, asking someone within IT to add the task to their ever-growing list of things to do, creating a new Data Lake, etc. Cliff is on the precipice of success, but there’s another step…
Step 11: Compositing
If the data set Cliff requested includes duplicates (e.g., same customer repeated two or more times from the same data set) and/or Cliff requested multiple data sets that contain overlaps (e.g., two or more data sets have the same customer), Cliff’s analysis could be skewed and misleading. Cliff needs a technique for consolidating multiple references to the same thing into a single, golden record to support his analysis. The process of reducing multiple references to the same real-world thing, is called Compositing.
In Step 7 – Data Matching, we automatically matched and linked records within and across data sets that represent the same thing, like Customer. It is during this step, we demonstrate the awesome power of how these link sets can be used to deliver truly trustworthy, high-quality data. Having the knowledge that three different records across two different data sets are the same customer, Cliff can define Compositing Rules that inform the solution on how to establish a single, golden record for analysis.
Some examples might be:
- All unique values
- Trusted Source
- Most commonly asserted value
- Most recently asserted value
For each attribute in the returned set of data, when there are multiple records containing a non-null value for an attribute, Cliff can establish a rule or rules for picking the value(s) that will represent the golden record. And if the attribute maps to a Reference Management code, its surviving value(s) can be transformed into the shared taxonomy for the logical attribute.
In the end, Cliff now has an absolutely pristine data set for his analysis. No coding. No begging for help. No artificial limits or obstacles. No risk or working around “the rules.” Just the best data the company has to offer that Cliff has authorization to use. That is truly a democratized data set.