Big Data and the Growing Need for a Data Catalog
In this day and age, our digital society produces more data than ever. So it should not come as a surprise that data is becoming one of the most important assets to an organization. To mine for the hidden gems in our vast amount of data, we have data stewards, data engineers, data scientists, and many others. While all of these people perform different tasks, they are all data citizens and they face the same set of challenges that come with big data.
Two important data challenges are volume and distribution. While distributed computing and storage systems like Hadoop are ruling the big data scene, they introduce a certain amount of chaos. It is easy to end up in a situation where you have some data stored in Hive, for example, while other data is located in files on HDFS with some applications storing their data in HBase or on a collection of relational databases not to mention NoSQL. Next to that, you probably still have some more traditional data warehouses in addition to the data that sits on employees’ workstations. So how can we manage all this data scattered across different systems? And how can we make sure that the right people get access to the right data assets? Preferably respecting relevant data protection context.
My big data career started when I joined the high energy physics research group at the Vrije Universiteit Brussels (VUB) as a PhD student in 2009. I was very fortunate
to be part of the Compact Muon Solenoid (CMS) collaboration which included roughly 3000 scientists and engineers. The enormous CMS detector is one of four technology marvels that were installed on the Large Hadron Collider (LHC), the world’s biggest and most powerful particle accelerator built at the European Laboratory for Nuclear Research (CERN) near Geneva, Switzerland. The LHC accelerates protons to nearly the speed of light and collides them head-on within the detector. These big machines can be thought of as huge digital cameras that can record pictures of the collisions that take place.
If you consider that CMS collects about 1 terabyte of data every hour at peak performance, we quickly end up with multiple petabytes of data each year. Think about the size of the data mop if you would have to clean this data…
The only way to deal with these kinds of volumes is through a worldwide distributed computing and storage system which is organized in Tiers: the Worldwide LHC Computing Grid (WLCG). In this system, the experiment’s raw data flows from CERN to different Tier1 centers where the raw data is staged and reconstructed from raw “pixels” to a higher level description including particles like electrons or muons. This data is then staged at different Tier2 centers across the globe where physicists run their daily analysis workflows.
Given the sheer volume and complexity of the data, most analysis work starts with some fundamental data-related questions like:
- Which pieces of the data do I need?
- Do we have this data?
- Where is the data physically located (imagine having to download a few terabytes over the Internet)?
- Is this data certified for publishing scientific results?
- Are there any issues reported concerning these data elements?
Answering these questions can be a painstaking process. However, CMS overcame this issue by constructing its data catalog. This tool really formed the central body of knowledge on the data. The physicists used the data catalog to find out which Grid centers were hosting which parts of the data, and which ones contained additional information on data taking conditions. It also housed information on data quality and certification, the software framework version used for reconstruction, applied calibrations, and much more. All of this information was crucial to correctly handle the data to produce high quality physics results. And with the data catalog, this information was available at the physicist’s fingertips, hassle-free.
What amuses me is that, although my work environment changed after I graduated and left academia, I still face the same fundamental data questions. And I believe that I am not alone. Next to proper data governance, the need for a data catalog will grow along with the tremendous growth of data assets. That’s why Collibra is releasing Collibra Catalog. By reducing the time spent on questions like “where can I find my data,” we can unlock more insights from our data. The data catalog will certainly become an indispensable part of our data toolbox.