Organization, Context and Judgment: Going Further on the Data Intelligence Journey
We are tasked with assessing why a company is experiencing a high rate of customer churn. How do we drive for an outcome that is accurate, actionable, and low effort so the company can prevent future churn? We’re on a Data Intelligence journey in 12 steps. Here’s where we started; now, let’s consider:
- Step 4 – Cataloging: Registering and contextualizing all things data or things that use data
- Step 5 – Lineage and Use: Understanding where data originated, how it traveled, and what happened to it along its travel
- Step 6 – Profiling & Scoring: Showing key characteristics, distributions and outliers to reveal the most trustworthy data
In this series, we’re on a journey. We’re following Cliff, a business analyst who’s been tasked with trying to find out why his company — which is doing well on many fronts — is experiencing a concerning trend of high customer churn. Given the large numbers at stake, the company must move quickly but they must accurately uncover the root cause and prescribe an action plan that works. Inaction is bad. The wrong action is worse. The answer is in the data. But what data?
This is the foundation of Data Intelligence. Data belongs to every knowledge worker and should flow through the organizational ecosystem in such a way as to let business professionals connect, communicate and collaborate in every way they need and choose. We’re with Cliff as he tries to find solutions to this real-world problem.
In the last piece, we covered the first three steps of this journey:
- Business Glossary: Developing a common language to ensure that every term commonly used across the enterprise means the same thing to every constituency
- Data Domains: Identifying the nouns that really drive the business — employees, products, customers, locations and more — to offer a canonical view of the enterprise
- Policy and Reference management: Roles & Responsibilities, Data Ownership, Data Usage Agreements, Retention and Destruction Policies, and much more serve as the framework of enforcement and adherence to the company-defined and regulatory-defined rules and guidelines.
Although arduous and sometimes tedious, skipping or rushing through these steps is tantamount to surrendering to the lure of a quick answer without substance. So, let’s move forward.
Step 4: Cataloging
Most companies have vast quantities of data and data sources fragmented throughout the enterprise. Cataloging is the process of discovering and registering, with context, these vast data holdings as well as artifacts that use data, such as Reports, API’s, Algorithms, etc., so that knowledge workers like Cliff can easily search for and locate items of interest and need.
This seems like the most fundamental aspect of data management, but again, the obstacles are considerable and time-consuming. Cataloging encompasses:
- Massive volumes: The physical elements (e.g., Database columns) can count in the hundreds of millions
- Frequent redundancy: The same data can be copied and re-stored under a different name many times over
- Countless variations: Different constituencies have diverse naming conventions; sizes and shapes of particular data sets vary throughout the enterprise
Cataloging starts with discovery — identifying and distinguishing between databases, reports, algorithms, APIs, topic and more. It organizes (and reorganizes) the data into accessible fields like tables/columns, and tracks data movement, such as from workbook to report. It not only accelerates machine learning but highlights the data used in the process.
The goal is to identify every data-related element and associate it with its logical peer in your canonical domain model (from Step 2). As an example, in Step 2, imagine you created the Customer Domain Model and one of the logical attributes of the Customer is Date of Birth. While cataloging your Salesforce Automation solution, you discover a physical column with a heading of Attr_Dt. Short of specialized knowledge, it would be challenging to determine what this physical attribute represents. The next logical step a data steward might take is to assess the table name, the adjacent column names, and even sample data in the Attr_Dt column. This could take minutes or more. Now imagine you have 5 million physical columns like Attr_Dt. At 1 minute per physical attribute for 5 million physical attributes, this effort would take a data steward nearly 40 person years to complete (8 hour days, 261 work days per year). Given how unacceptable this would sound to anyone, the act of cataloging requires automation to not only discover but to contextualize, also known as classifying, your data vis-a-via the logical domain model. By automating the logical next steps – table name, adjacent columns, sample and content inspection — through the use of a classification algorithm (machine learning), companies can radically reduce 40 person years into months, weeks or even days of manual effort.
It is not appropriate to consider cataloging an event; rather, it is an ongoing process. New data sets not previously cataloged will appear. That said, you do not have to catalog everything in your ecosystem to start delivering value to your users, like Cliff. The bootstrapping of your Data Intelligence graph – associating and linking nodes via cataloging – invites adoption, use, and contributions. It is the equivalent to opening your store for business before you have stocked every shelf; your consumers will be happy to make suggestions and comments on what else you need to focus on.
Step 5: Lineage and Use
Continuing with the store opening metaphor, where the foundational steps (Steps 1 – 3) are the equivalent of creating the physical building and Cataloguing is the equivalent of stocking your shelves in a thoughtful (classified) way, it is then logical to think that a shopper like Cliff will want to know a bit more about the item(s) on the shelf before making a decision to place in her or his shopping cart. One such item of considerable interest is where did this data come from and who else uses it? The incorporation of data lineage to your knowledge graph addresses these specific questions. Not unlike cataloging, data lineage requires a discovery and harvest method. There are various ways to discover and harvest data lineage information, such as from SQL (e.g., stored procedures), ETL/ELT technologies, Reporting/BI Platforms, and code scans. A processor of the harvested information should unearth, at its basic core, a physical element (node 1), a second physical element (node 2), and any logic (e.g., transformation) placed on the value from node 1 before it is inserted into node 2. These nodes can then be linked back with the cataloged nodes (called stitching) and a link (or edge) is made between the two nodes in the Data Intelligence Graph. Lineage helps organizations connect different systems and processes to offer a complete picture of how data flows across the enterprise, at the conceptual, logical and physical layers. Where cataloging enables discovery at rest, lineage how it got there and where it goes from there.
Data lineage reveals how data is transformed through its life cycle across interactions with systems, applications, APIs and reports. It automatically maps relationships between data to show how data sets are built, aggregated, sourced and used, providing complete, end-to-end relationship visualization.
This adds accuracy and understanding to raw data, enhances trust, and fuels sharper inferences and business insights. It even enables impact analysis at a granular level — columnar, table, or business report — of any changes to downstream systems.
This is a strategic advantage, and it reflects recent advances. For much of the digital era, data architects had to build relationships manually between large data volumes to create lineage graphs. Newer technologies allow for most of this work to be done (almost) automatically, and much more efficiently. Today, by extracting lineage automatically from dispersed source systems and keeping it up to date, organizations can devote resources to strategic initiatives rather than endless data mapping.
Besides the clear business advantages — such as helping Cliff identify patterns in customer behavior — lineage can play a key role in ensuring compliance. A technical lineage view allows users to visualize transformations, drill down into table/column/query-level lineage, and navigate through data pipelines. This is important for providing necessary information to regulators.
Step 6: Profiling & Scoring
Again, referring back to the shopping example for Cliff, a common request when assessing between 2 or more options is the ability to compare your options based on what is important to you. Let’s suppose that when Cliff is looking for a data set for his or her churn analysis, Cliff believes that Age is an important criteria to segment on. As Cliff evaluates his or her options for Customer data across a range of systems (e.g., Salesforce Automation, ERP, Order Management, Web, etc) that have been cataloged and classified to the Customer Domain Model, it would be beneficial for Cliff to determine which of these data sets offers the most ideal data quality and veracity. Using the shopping example from prior, Cliff wants to look at the ingredients and compare them with his or her options.
A common method for providing visibility to the “ingredients” is to Profile the content of what can be found in a given physical column of data. Profiling extracts statistical information such as how many rows, % null, % invalid, frequency distribution, minimum length, etc. This information might be useful to some users steeped in data science and data quality, but for shoppers like Cliff, this is largely unreadable or requires too much effort to use as a guide to rapid decision making. But imagine if you could compute a Score based upon all the statistical information gathered from Profiling. And further imagine if this Score would help you quickly rank and order your options, making your choice for the best data set to leverage easy, accurate, and trustworthy. Cliff will be able to look at his or her options side-by-side, quickly determine what is best for his or her analysis, and confidently put the request in the shopping cart.
In business terms, these capabilities offer Cliff greater freedom to develop his own analyses. If he’s looking at subsets such as customer age range or zip code or buying frequency to identify patterns behind the churn, he can work only with the most trustworthy data, rather than all the data or run the risk of working with the data of least value.
To summarize, first we built the shopping center. Second, we acquired the items for sale, organized and placed the acquired items on shelves for easy discovery. Then we made it easy to determine where the item came from, what happened along its journey, who else uses it, and where does it go from that point. And finally we provide easy to use ingredients and comparison scores to provide visual clues in support of comparison shopping. Cliff is extremely productive and confident about what he wants, but it seems like it will be complicated to gather and assemble it all.