Whenever I speak with data scientists, the words “model” and “data” pop up all the time.
When I challenge these smart model builders about the importance of good data assets, they wholeheartedly agree. They say, “Of course, data is very important.”
Yet, when I push further, they often say it’s someone else’s job to look after the data. They say, “Bob or Mary is ensuring good data management with governance, quality, lineage. Data is their responsibility.”
So what comes first: The data or the model?
Let’s explore this fascinating debate a bit more in the context of 2023’s very hot summer topic of AI.
AI is dead! Long live AI!
Prior to November 2022, most organizations struggled with these key challenges building AI applications:
- It was hard to find good, experienced data scientists
- There were issues finding good quality data to build models
- Getting (and keeping) good models into production was daunting
Ultimately, these businesses did not adjust their approach, treating AI development like just another application development project…let’s call this “old AI.”
In November 2022, AI leaped into mainstream awareness, crossing the chasm of product adoption with OpenAI’s transformative offerings. Nearly six months have passed, and many other companies have jumped into the market with generative AI products, including Microsoft, Google, and Hugging Face. Let’s call this “new AI”.
In the “old AI” world, you would carefully curate a good data asset, and then train that into a model that could churn out smart, automated business decisions.
In other words, the data came first. But the model would only know what it was told, and the data domain would typically be a specialized, niche area of your business.
In a recent survey, a majority of businesses now have AI initiatives on their roadmap. In fact, 78% say that planning to scale AI and ML use cases to create business value is their top priority over the next three years (1).
With the “new AI,” the rules have changed. Today, organizations have trained very large models for a very long time on public data from the Internet. The data now includes online conversations, news, documentation, blogs, social media — everything and anything that’s publicly available.
The broader, wider, data domain led to a pertained foundational model that now seems like it knows a little bit about everything, and can talk about everything very eloquently and in any written style you can imagine — from analytical to poetic.
For businesses adopting “new AI,” the model came first (and has probably been wrapped in an easy API or hidden behind a chat interface), and it reached further and wider than any AI has ever reached.
However, the “new AI” foundational model is the same for everyone. You still need to teach it about your specific business: your customer conversations, your products and services, your organizations’ knowledge graph — whatever domain you want to apply AI, you still need to train it. Without your own data to tune the foundational model, you’d have no way to differentiate.
Overview: The data- vs model-centric debate
At its essence, AI is made up of software and data.
The software — or code — is used to build AI models that data scientists ‘feed’ with data during the training stage of AI application development.
Until very recently, only large consumer technology companies had the capital, the data scientist teams, the massive datasets, and extensive compute capabilities required to develop generative AI applications of ChatGPT’s magnitude.
These efforts could afford to focus on model-centric AI development because they had both the data and the expertise to do so.
However, at least since Google Brain co-founder Andrew Ng’s July 2021 Harvard Business Review article, ‘AI Doesn’t Have to Be Too Complicated or Expensive for Your Business,’ data and enterprise leaders have been aware of the challenges with a model-centric approach for the majority of companies that want to use AI.
… the bottleneck for many [AI] applications is getting the right data to feed to the software. We’ve heard about the benefits of big data, but we now know that for many applications, it is more fruitful to focus on making sure we have good data — data that clearly illustrates the concepts we need the AI to learn. This means, for example, the data should be reasonably comprehensive in its coverage of important cases and labeled consistently. Data is food for AI, and modern AI systems need not only calories, but also high-quality nutrition.
While model-centricity is a strategy that focuses on improving performance by optimizing the model, the data-centric approach recognizes that most organizations need to optimize AI applications by focusing on the data.
Why data-centric AI development is right for most businesses
For AI practitioners and business leaders seeking efficient pathways to development, there is an active debate about whether or not to focus on a model- or data-centric approach.
At Collibra, we know data is at the heart of any data product, and nowhere is this more true than with AI models.
Today, the AI community is moving toward a consensus that data quality and consistency improve AI accuracy more efficiently for most businesses than tweaking models. And especially when you are using one of the mainstream foundational models you’ll need unique, quality data to differentiate yourself from all the API copycats.
Gartner recognizes this as well. Their January 2023 research note shows that organizations are actively focusing on data usage, data governance, data curation, training data and several other data-centric capabilities to improve the quality of models, mitigate bias, and augment the data science machine learning (DSML) workflow (3).
While large, data-rich technology companies can rely on massive datasets and one-size-fits-all models to serve customers, everyone else should be focusing on curating their unique data assets, and making sure they are of high data quality.
After all, AI applications are only as good as the data that informs it. If the data is bad, the AI models trained or tuned on bad data will produce human-sounding language that looks good but remains fundamentally flawed.
Is AI an automated highway to value or problems?
As much as AI can bring value to your organization it does need controls to harvest that value. We’ve covered the real risks — legal, financial, reputational — when generative AI goes wrong in our inaugural AI Governance blog.
The quality of predictions of AI models depends strongly on the data used to train the models. Poor data quality can result in inaccurate results and inconsistent model behavior, leading to lack of trust from customers and internal stakeholders.
From the attorney who presented legal precedents that contained ChatGPT “hallucinations” to the hospitals who want to apply ML algorithms to help patients, but discovered that the algorithms often misdiagnosed them (4) — the demand for AI is spreading across every segment of our society faster than the guardrails for protecting organizations from making catastrophic missteps.
In the UK, the national center for data science and AI — the Turing Institute — studied the capacity of ML algorithms to improve and accelerate patient diagnosis and triage during the pandemic.
According to the Institute, the predictive tools made little to no difference. “Problems around data availability, access and standardization spanned the entire spectrum of data science activity during the pandemic. The message was clear: better data would enable a better response.” (5)
As we stated in our AI governance overview, you must remember that your organization remains responsible for the outcomes, not just for the algorithm or its data.
AI governance for your data-centric development
How do you mitigate the risks inherent in AI application development? How does your organization ensure the quality, integrity, and ethical handling of the data used to train and operate AI systems?
What you need is a governance model for AI. You need AI governance.
How we define AI governance:
AI governance is the application of rules, processes and responsibilities to drive maximum value from your automated data products by ensuring applicable, streamlined and ethical AI practices that mitigate risk and protect privacy.
And whether you tackle models first, or data first – your controls will need to wrap around both, from input data, over pipelines and models, all the way to your outcomes.
If your organization is leveraging (or planning to use) generative AI technologies, then it’s a good time to start thinking about AI governance.