Thought Leadership

What is reference data and why is it important?

Share on:

Update from December 10, 2020

For business and technical users alike, reference data impacts daily operations. In order to optimize data use and availability, organizations need to know what reference data is, what it is not (i.e. master data), why it is important, and how to efficiently manage it with technology.

In his book Managing Reference Data in Enterprise Databases, Malcolm Chisholm, a world-renowned data management thought leader, defines reference data as “any data used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise.”

Reference data carries meaning. It establishes permissible values, facilitates consistency, and maps internal data against external data and/or standards. Although it represents a small share of total data volume, reference data represents 25% to 50% of tables in databases and affects reporting accuracy and data governance.

Find out how Froedtert & the Medical College of Wisconsin used Collibra to improve data governance and make quick, effective data-driven decisions.

Read their Customer Story and discover what we can do for your organization.

Examples of reference data

Many reference data assets are maintained by standards bodies like ISO or by industry consortia. Some examples are:

Country codes
Measurement units
Currencies
Financial hierarchies
Products and pricing
Exchange codes
USPS postal codes

Reference data characterizes data and relates data to information in both internal and external databases. It can be as simple as specifying that all customer phone numbers must be ten digits in a customer relationship management (CRM) tool. These defined sets rarely change and data users consistently use them in lookup tables, drop down lists or pre-filled forms.

However, not all code sets are so cut and dry. Take something like a country code, again seemingly simple, but even the International Organization for Standardization (ISO) defines codes for countries in different ways under ISO-3166:

Reference data can also change over time, so organizations need to continuously refresh and manage data to maintain quality. For instance, country codes change an average of 3-5 times per year, and currency codes change an average of 5 to 10 times per year.

Organizations use, customize and extend numerous existing industry ontologies to meet changing needs over time; as a result, they need to maintain consistency with the original standards to prevent drift from the external semantics. Any inconsistencies can impair decision making and diagnoses, and incur liability. To avoid these inconsistencies and minimize the consequences of poor reference data management, organizations need to make use of robust governance practices and policies.

What is reference data vs. master data?

A common misconception is that reference data and master data are identical, but they are two different types of data.

Reference data is the data used to define and classify other data. Master data is the data about business entities, such as customers and products. Master data provides the context needed for business transactions.

While both reference data and master data management provide context for business activities, their usage and implementation can help define their differences. First, domain and subject matter experts curate, centrally administer and publish reference to downstream systems. Reference data often drives control logic. It categorizes data into groups before data consumers analyze them, sometimes to unify external and internal data, and other times to classify it into buckets for analysis.

In a succinct sense, reference data are sets of values or classification schemas that are referred to by systems, applications, data stores, processes, and reports, as well as by transactional and master records.

On the other hand, master data describes the people, places and things involved in an organization’s business. Organizations use master data to apply quality rules, manage their transaction structure data and enterprise structure data to create a single golden record.

Why is reference data important?

Reference affects every part of the organization because it helps provide context to data. It affects data quality and in turn, data usability. Efficient reference data management is necessary for organizations aiming to achieve Data Intelligence.

Reference data use cases

Organizations use reference data to address a number of use cases. For example:

Agreed upon metrics and hierarchies – Shared understanding across the organization helps build common metrics and hierarchies that can be easily leveraged for efficient operations
Clear data controls – Managing reference data access control helps establish ownership and accountability. It goes a long way in improving the data governance that is essential for trust in data
Trust in data quality – Consistent reference data usage can help build a single trusted view of data across the organization
Faster delivery of insights from data – Streamlining and automating reference data management seamlessly provisions it to all stakeholders. With access to quality reference data, business users can quickly derive insights from data, powering their business decision

Consequences of poor reference data management

Misalignment of data and manual management of reference data poses many challenges and real consequences, such as:

Insufficient governance – Organizations often have dozens or even hundreds of applications that hold data used by different people and different teams. Fragmented data and applications cause misalignment across the organization and make it difficult to formalize information, standards and processes. Since most organizations typically handle data governance activities manually, this results in slow and error-prone change. management and fragmented and inconsistent reference data across the enterprise.
Inaccurate reporting and analytics – Inconsistent code values result in inaccurate and untrustworthy reporting and analytics. For example, business analysts examine data and make recommendations for critical decisions using reports based on regions, business units or territories, all of which represent reference data. If each source uses different code value sets, manual intervention becomes necessary to ensure the accuracy of data aggregation and business analytics.
Inefficient operations – In order to get the most of data, data stewards need to monitor and refresh reference data consistently. However, manual reference data management is slow, prone to errors and not scalable. As an organization grows, this management becomes heavier and more complex, magnifying the operational and financial repercussions.

How do organizations manage reference data?

A reference data management tool is a mechanism that defines business processes around reference data and helps data stewards populate and manage it over time. Such a tool

Automates workflows to create new codes and code sets
Delivers codes and code sets to data users
Maps data
Compares data from various parts of the organization

Required capabilities for managing reference data

In order to effectively manage reference data, organizations need a suite of capabilities. An efficient reference data management solution must manage complex relationships across the enterprise. Organizations must invest in a data governance solution with native reference data management and additional lineage, stewardship and workflow capabilities features to resolve inconsistencies in the data:

Data governance – Data governance and reference data management go hand in hand. Data governance tools with native reference data management allow a complete audit trail and full visibility into processes, ownership and stewardship roles, and a shared understanding of reference data
Data lineage – A capability for mapping reference data from different sources to shared code sets and linking it to relevant terms for business and technical context
Workflows – Clearly defined and automated processes to facilitate collaboration and resolution of data inconsistencies
Stewardship – A system for managing tasks, roles and responsibilities to facilitate management as the data ecosystem evolves
Policy management – A tool for creating, reviewing and updating data policies to ensure adoption and maintain compliance

Managing reference data with Collibra

Many organizations use Collibra to manage their reference data. By leveraging Collibra’s products and capabilities like Data Governance, Collaborative Workflows, Data Stewardship, and more, our customers manage their reference data in the context of other initiatives and achieve data intelligence, all from one platform.