Is Your Data Lake like the Library of Babel?
In 1941, renowned Argentine author Jorge Luis Borges published a short story entitled “The Library of Babel.” The story tells the tale of a universe consisting of an unimaginable stretch of hexagonal rooms, each of which hold the bare necessities for human survival and four walls of bookshelves. Though the order and content of the books are arbitrary and seemingly entirely meaningless, the inhabitants believe that the books contain every possible ordering of just 25 basic characters (22 letters, the period, the comma, and the space). Imagine how many books that would be! Some of the books are pure gibberish, while others are highly-relevant and useful. The latter may describe predictions of the future and biographies of any person, including slightly different or erroneous versions as well as translations in all languages.
Surely a reader entering the library would find the sheer volume of books unmanageable. It’s a pure glut of information, with no way to distinguish the meaningful books from the useless ones. But as the story progresses, the librarians begin to take matters into their own hands. In a desperate attempt to make sense of the litany of information available, they adopt extreme behaviors. Some become “Purifiers,” librarians who arbitrarily purge books they deem as nonsense. They define the criteria for what is good – and what is not – with little to no input from others. Others, in contrast, believe that somewhere, hidden in the vast realm of chaos, there is a book that catalogs all the library’s contents. And that a “man of the book” who has found – and read – this index and translated it into something useful for people entering the library. Clearly, this index would be helpful to people trying to find or understand the library’s contents. In both cases, the goal is to gain control over the library and the vast amount of books it contains so that the readers can find what they need. But the approaches are, indeed, very different.
Now, you’re probably wondering what “The Library of Babel” has to do with data. Well, the parallels are actually quite striking. Think about your data lake. In theory, it contains nearly every piece of data in your organization. Some of the data is meaningful, understood, and trusted. Other data is gibberish because it lacks meaning and trust. Both types of data live together in the data lake, and distinguishing the good from the bad is no simple task.
Moreover, organizations must also look outward as there are many more hexagonal rooms to scourge through. IDC, a market-research firm, predicts that the “digital universe” (the data created and copied every year) will reach 180 zettabytes (i.e., 180 followed by 21 zeros) in 2025 (see chart). Pumping it all through a broadband internet connection would take over 450m years. (paragraph From the Economist). In fact, I believe that the real era of big data is still to come.
Moreover, the quality of data has changed. They are no longer blocks of structured information, including databases, data warehouses, and other well-defined master customer records with age, sex, and home address. It is more about finding and rapidly understanding real-time streams of data: social media updates, mass transit movements, and the hundreds of sensors in jet engines and public places.
Now think about the people in your organization who manage and use the data. Surely there are “Purifiers” – the people who purge data at will in an effort to control the data chaos that exists within the lake. They are the data authority – the ones who decide what data is right – and what data is not. They define their own standards for quality without engaging with others across the business. And they refuse to compromise when data fails to comply. Like the “Purifiers” in Borges’ story, they purge data deemed unworthy. To me, they are not the best people to manage your data. Why? Because even though their intentions are pure, they lack collaboration. And that means that others do not have a say in which data stays and which data goes. And it’s possible that the data the Purifiers expel is data that is critical to a certain area of the business.
There are others who manage data who are collaborative in nature – the people who embrace data citizenship. They believe it’s possible to get a grip on the data by working together across the organization to understand the data’s meaning and use. These people believe there is a way to control the data before it enters the data lake. They are advocates for defining rules and operating models about which data enters the data lake. And they work hard so that all users can find the data, understand what it means, and trust that it is right. They are searching for the mythical “the man of the book” so that they, too, can uncover an index of all the data hidden in the depths of the data lake.
In the world of data, we know that no such book – nor “man of the book” – exists. However, many organizations are using a data catalog to help them gain control over the glut of information stuffed into their data lakes. A data catalog helps organizations index the data and link it to agreed-upon definitions about quality, trustworthiness, and use. It helps users to determine which data is fit to use – and which they should discard because it’s incomplete or irrelevant to the analysis at hand. It provides the collaboration that is lacking when a “Purifier” takes control. And it helps all data users to find, understand, and trust their data.
How do you manage your data lake? With a “Purifier” or a “Man of the Book?”
Pieter De Leenheer is cofounder and Chief Science Officer of Collibra. He leads the company’s Research & Education group, including Collibra University, an online learning platform for data governance and data science education. Prior to cofounding the company, De Leenheer was a professor at VU University of Amsterdam. Today he serves as adjunct professor at Columbia University and as visiting scholar at several universities across the globe, including UC San Diego and Stanford.