Gain full visibility across your data landscape, find meaning in your data and improve the quality of business decisions.
Discover and download solutions and pre-built integrations for the Collibra Platform.
Get unparalleled value through the combined expertise and unique strengths of our people and technology.
See how security plays a key role in everything from how we build and deliver our platform to how we hire and train employees.
Collibra Privacy & Risk
Discover and understand data that matters so you can generate impactful insights that drive business value.
Understand your ever-growing amount of data in a way that scales with growth and change.
Show how data sets are built, aggregated, sourced and used, providing complete, end-to-end lineage visualization.
Build customer trust by operationalizing privacy policies and scaling compliance across new regulations.
Modernize your operations with a solution that is scalable, accessible and resilient: data in the cloud.
Drive digital growth and customer engagement by breaking down data silos and adding value to customer interactions.
Fuel your self-services analytics with the right data to develop unique business insights.
Innovate for the future while successfully navigating the complex web of regulations.
Transform decision making in the public sector with secure Data Intelligence that is FedRAMP Authorized.
Cloud ready data
Government and public sector
Tap into our knowledge base by connecting, sharing and learning from your peers in our Data Citizens community.
See how Collibra is helping global organizations unlock the value of their data.
Find the resources you need to accelerate time to value and fuel your growth.
Learn from the leaders in Data Intelligence through our individual courses, learning paths, and certification programs.
Data Citizens '20
Take your data strategy to the next level by arming yourself with the knowledge you need to achieve Data Intelligence.
Get advice, tips and tricks from our product experts and industry thought leaders to learn how to make your data meaningful.
Join the world’s largest virtual gathering of professionals focused on empowering businesses to deliver on strategic goals through Data Intelligence.
Check our upcoming events calendar to discover exciting opportunities to learn from our product and industry experts.
Connect the right data, insights, algorithms and people to optimize processes, increase efficiency and drive innovation.
Read our latest announcements, news coverage and thought leadership articles.
Find an opportunity to challenge and be challenged, and work with some of the most talented people in the business.
Get in touch with a member of our global team by locating an office near you, calling us or sending an email.
In this day and age, our digital society produces more data than ever. So it should not come as a surprise that data is becoming one of the most important assets to an organization. To mine for the hidden gems in our vast amount of data, we have data stewards, data engineers, data scientists, and many others. While all of these people perform different tasks, they are all data citizens and they face the same set of challenges that come with big data.
Two important data challenges are volume and distribution. While distributed computing and storage systems like Hadoop are ruling the big data scene, they introduce a certain amount of chaos. It is easy to end up in a situation where you have some data stored in Hive, for example, while other data is located in files on HDFS with some applications storing their data in HBase or on a collection of relational databases not to mention NoSQL. Next to that, you probably still have some more traditional data warehouses in addition to the data that sits on employees’ workstations. So how can we manage all this data scattered across different systems? And how can we make sure that the right people get access to the right data assets? Preferably respecting relevant data protection context.
My big data career started when I joined the high energy physics research group at the Vrije Universiteit Brussels (VUB) as a PhD student in 2009. I was very fortunate
to be part of the Compact Muon Solenoid (CMS) collaboration which included roughly 3000 scientists and engineers. The enormous CMS detector is one of four technology marvels that were installed on the Large Hadron Collider (LHC), the world’s biggest and most powerful particle accelerator built at the European Laboratory for Nuclear Research (CERN) near Geneva, Switzerland. The LHC accelerates protons to nearly the speed of light and collides them head-on within the detector. These big machines can be thought of as huge digital cameras that can record pictures of the collisions that take place.
If you consider that CMS collects about 1 terabyte of data every hour at peak performance, we quickly end up with multiple petabytes of data each year. Think about the size of the data mop if you would have to clean this data…
The only way to deal with these kinds of volumes is through a worldwide distributed computing and storage system which is organized in Tiers: the Worldwide LHC Computing Grid (WLCG). In this system, the experiment’s raw data flows from CERN to different Tier1 centers where the raw data is staged and reconstructed from raw “pixels” to a higher level description including particles like electrons or muons. This data is then staged at different Tier2 centers across the globe where physicists run their daily analysis workflows.
Given the sheer volume and complexity of the data, most analysis work starts with some fundamental data-related questions like:
Answering these questions can be a painstaking process. However, CMS overcame this issue by constructing its data catalog. This tool really formed the central body of knowledge on the data. The physicists used the data catalog to find out which Grid centers were hosting which parts of the data, and which ones contained additional information on data taking conditions. It also housed information on data quality and certification, the software framework version used for reconstruction, applied calibrations, and much more. All of this information was crucial to correctly handle the data to produce high quality physics results. And with the data catalog, this information was available at the physicist’s fingertips, hassle-free.
Click To Tweet
What amuses me is that, although my work environment changed after I graduated and left academia, I still face the same fundamental data questions. And I believe that I am not alone. Next to proper data governance, the need for a data catalog will grow along with the tremendous growth of data assets. That’s why Collibra is releasing Collibra Catalog. By reducing the time spent on questions like “where can I find my data,” we can unlock more insights from our data. The data catalog will certainly become an indispensable part of our data toolbox.
Michael is Big Data Specialist at Collibra. He is fascinated by Big Data technologies like Hadoop and Spark as well as data science at scale. Prior to joining Collibra, Michael was a data mining applications developer at a direct marketing company and holds a PhD degree in experimental High Energy Physics from the Vrije Universiteit Brussel.
© 2020 Collibra. All Rights Reserved.
A message to our Collibra community on COVID-19. Read more from our CEO.