Are You Getting Value From Your Data Lake? (But are you, really?)
At a time when data is growing exponentially, when data sources are multiplying in quick time, and when machines are connecting data points at speeds that outstrip our own human capacity for thought, we need a great new way to store vast amounts of data and make it accessible to analysts and other data users in near real time.
Data lakes, often built on Hadoop, seemed like the perfect answer. Scalability seemed infinite. Lots of data from just about anywhere could be ingested without time-consuming controls. And it could be stored as unstructured data, with schema-on-read capabilities for ultimate flexibility.
It was the promise of more data, more flexibility, and, ultimately better insights.
You know the next chapter of this story—lots of data was ingested, because scalability! Very little data was tagged or identified, because flexibility! And the data lake, with all of its potential, became a bit less easy to manage, and a bit more opaque to the data scientists who needed to work in its depths.
But there’s good news. Since the warning shot that sounded the alarm, many organizations are working hard to clean up their data lakes and they’ve made progress. But expectations for the data lake will only continue to grow. Driving value from your data lake as it grows will continue to be a challenge without better processes in place to help your data scientists find, understand, and trust the data they need to discover those bold new insights your organization is counting on.
Today, organizations are more inclined to think about the data lake in terms of business value. What will the data lake be used for? How will it align to business goals? Is all data equal or some data more equal than other data? Aligning your data lake with business priorities is the first step in driving value.
The data lake must also serve the needs of the people fishing in it—your data scientists. If they aren’t able to find—and make sense of—the data in the lake, their work is going to be hindered. That’s where governance, overlaid with a data catalog, can help. Governance, as we here at Collibra understand it, isn’t about locking down data. It’s about making data discoverable and meaningful. It’s about helping everyone understand what data is the right data in any given context. And it’s about connecting people to the data they need to do their jobs.
Maria Spanicciati is Content Manager and Editor of the Collibra blog.