It is often asked if a data governance program automates not only the ingestion of technical metadata, but also the connection of the technical metadata to its business context.
For example, we can import into a semantic catalog a list of tables with their columns from a schema or a database from a system. And as a result, we can see the hierarchy of this data set nicely catalogued into a data governance environment like Collibra. Here is an example:
As shown above, we see the ingestion of two tables and their columns. However, what is the next step? Some questions that we need to answer are:
- What is the meaning of each of the columns? What do they represent?
- What role does a column play in the formation of other columns? For example, is that column part of the calculation of a metric in a report?
- Which columns represent critical business elements like a metric, a KPI, or a master data attribute?
The questions above are just the beginning of this course, as there are a lot more to consider:
- Lineage of transformation from source to target
- Authoritative source versus just another database containing this value
- Trust level which implies data quality rules applied and measured
- Data quality rules relate to the asset, with its hierarchy of business rules and policies
- Processes and business units associated
- Stewards and responsible parties
But let’s get started with the three important questions above that address the first line of lineage.
It is easier to start from importing technical metadata. Just scan the databases and bring them all in. No requirement for business models, or ontologies, or asking business people to design anything. But then what? How do you know that this column stores a customer ID? In addition, how do you know if this customer ID is a master attribute or not? Furthermore, how does it translate into a business word that any user can understand? Is it a customer ID for the Customer Service center?
The proposed solution
There are two ways to understand the meaning of data. Let’s look at each one in detail:
User input is the first step. The user can indicate some of the following right upon data ingestion:
- Context of the schema or table that is being ingested. For example, it is about customer data for service center or from the master data database.
We can try to automate the correlation of technical metadata to business data. There are a few ways we can do that:
- The tool can read the table columns and based on the name of the table column, you can identify a close connection to a business word that can potentially represent the meaning of the data in the column.
- DBAs can name columns by using a full English name or a combination of names. For example: CUSTOMER_ID. In this case the parser will remove the underscore, and will return two words: CUSTOMER ID. This by itself is an English set of words that can be used for a business term. Sometimes these English words are truncated by removing the vowels or by cutting the word at some point. If the word is cut somewhere in the middle, for example CUST_ID then we can use a prefix matching algorithm like a “trie”. The “trie” will attempt to find the possible English words that stem from its trimmed down version. A simplification would be the use of deterministic acyclic finite state automaton (DAFSA), which takes less space than a trie.
- If the name of the column is not the prefix of an English word, but a different variation, for example it is made by the removal of vowels and the insertion of special characters, then we can index that word and use the “trie” algorithm on the subset of the word in order to come up with a suggested English word. For example, CUSTMR_ID can be indexed at the first non-vowel character immediately following a vowel. So, we can start with the prefix CUS and we use the trie method as above to suggest the rest.
- If, however, the case is not as simple as the removal of vowels, and we want any subsequence of the word to match the word, then this is a more complex problem and we need to match a word using the longest common sequence measure (LCS). With LCS, we can define whether the string we have as a column name has a common sequence of characters with an English word from the dictionary. These algorithms are very complex and time consuming to implement.
- We can improve the suggestion of a correlation of the table column name to a business term by also looking at the content of the table column by parsing the data. This way, we can assign probabilities to the suggested business name/term which can be enforced by the parsing and analysis of the column content.
- If, for example, the content of the table column is an integer, and the name is suggested to be CUSTOMER ID, then the probability of our business term suggestion being a success is higher.
- Also by looking at the possible patterns of the content of the table column, we can further enhance our suggestions. For example, using pattern searching algorithms within the content of a table column, we can identify if a column is for example a phone number (by identifying a pattern of a 10-digit number in the sampling of content data), a social security number (a pattern of a 9-digit number), or a financial number used for a KPI.
- There are two kinds of artificial intelligence algorithms. The self-discovering patterns within text and the machine learning ones. For machine learning, you need a training set first to start training the program on what should be expected from the incoming data. With the pattern discovering approach, there is no training set required to start the work.
- In the case of parsing the content of the table and finding patterns, the first step would be to use data mining algorithms. These do not require a training set, but they are more expensive to implement as they require more recursive passes over the content and more resources.
Artificial Intelligence offers an amazing array of algorithms and methods to help us automate matching technical names of assets to business terms and concepts. However, a methodical step by step approach is necessary for this implementation. Also, some of these methods can be very complex. And most of the time the output should be confirmed by a business user.
There will always be a need for a business steward to validate the automated work of business model design based on a technical data model. And the work that comes out of a technical model in an automatic way is certainly very helpful, but will never be final before the human factor refines it.
Vasiliki has 25 years of experience in Enterprise Information Management and Architecture. She worked for 23 years at IBM and Oracle, and is currently part of the information governance sales team at Collibra. She also serves as an adjunct professor at Fordham University, Gabelli School of Business where she teaches an evening graduate class on Information Management. Vasiliki is fluent in French, English and native Greek and Ancient Greek.