I have been a proud, card-carrying member of the Hunter-Gatherer Society for over three decades. By the Hunter-Gatherer Society, I mean all the processes that data analysts and data scientists spend the majority of their time on: gathering up and preparing data to use for the analytic models and reporting they are responsible for delivering to the business.
Over the course of my career, I faced different challenges than exist today. While there was a limited amount of data, all of the applications were internally-developed. The documentation was limited and out-of-date. The semantics and meaning of the data were limited to a few that had tribal knowledge. But these few were not incented to share their knowledge. I could not use the data where it existed in the application, so I had to extract it manually and integrate it in a different data structure. The gathering, integration, standardization, and cleansing of the data was hand coded, and generally took 15 to 18 days per month. That left me with two to five days per month to actually conduct the analytics and make recommendations. I often spent 90-95% of the month gathering the data and only 5-10% conducting the analytics and analysis to make recommendations.
Data warehousing architecture, techniques, and technologies have improved the gathering portion of the challenges. However, we still face challenges with finding the right data given the increasing volume and variety. I recently read articles that suggest today’s data scientists spend an average of 75% of their time finding the right data, validating the quality and controls applied, integrating with other data sets, and then preparing it for analytics. Thus, all our efforts with data warehousing may not have made the significant improvements in the challenges of the society.
The Challenges Facing Data Scientists
Data scientists today face greater challenges than what I faced decades ago. What are the challenges that they must address? They need to ask the questions such as:
- What data can I access to conduct my business processes?
- The organization states requirements in business terms such as what is the customer churn, by region and product line, for retail women’s clothing and what is the percentage of dislikes associated to those customers from Facebook. But where do I find the physical data in our data lake and what are the authoritative sources?
- How does the organization define the data? And do they define it like what I think it is?
- What are the data quality and controls applied? Do they meet my expectations for my analysis processes? Can I trust it?
- Where did the data come from? Can I depend upon this source?
- How timely is the data in this source?
- What controls has the organization applied to this data? And do they meet our industry expectations for controls compliance?
- How do I get access to this data?
- What are the usage constraints (security, privacy)?
- Who else uses this data? For what purposes? Can I ask them about using this data?
- What is the update or refresh frequency for this data?
- Where in the data lake is the Facebook data? Where is the customer number and the product number in this data stream?
- Who do I talk to for further understanding and HELP?
We are addressing these challenges with the Collibra Catalog. The data catalog is a new approach that many data scientists and analysts look forward to leveraging. I know I am. And while the data catalog certainly benefits all data citizens across the organization, let’s take a closer look at how it helps data scientists specifically.
Addressing the Challenges with a Data Catalog
A data catalog creates a better method for analytics professionals to easily find and determine the value of trusted data. It is a single source that gives analysts a view of the data that the organization maintains. It contains the metadata such as definitions of the data objects such as tables, columns, synonyms, value ranges, indexes, consumer groups, and accountable parties. And the data catalog provides a process to view the physical data environment that has a linkage to the business glossary. Thus it has the capability of linking the technical metadata with the business metadata to answer all of the previous questions.
The data catalog is an effective technology that reduces the duration of time data scientists need to conduct the “hunting and gathering” activities. Knowing the quality, controls, lineage, and authoritative source of data will also reduce the duration of time data scientists need for standardization, cleansing, and scrubbing the data prior to integration. Understanding the business definitions and usage constraints will additionally reduce the duration of time required.
We may never make the self-service analytics environment completely automatic, as the use cases are broad and complex. Yet, the data catalog is a significant tool for the data scientist and should reduce the percentage of time necessary to conduct the Hunter-Gatherer activities. That will provide the data scientist with additional time to actually conduct the analytics activities that the organization hired them to perform. And that should lead to greater business value and opportunities from analytics. Let’s all leverage the Collibra Catalog and stick a spear in the Hunter-Gatherer Society. We need to improve the productivity of the data scientist while improving the trust in our analytics.