AI and Statistics: Perfect Together

Many companies develop AI models without a solid foundation on which to base predictions — leading to mistrust and failures. Here’s how statistics can help improve results.

Thomas C. Redman and Roger W. Hoerl April 29, 2024 Reading Time: 9 min

Topics

Carolyn Geason-Beissel/MIT SMR | Getty Images

People are often unsure why artificial intelligence and machine learning algorithms work. More importantly, people can’t always anticipate when they won’t work. Ali Rahimi, an AI researcher at Google, received a standing ovation at a 2017 conference when he referred to much of what is done in AI as “alchemy,” meaning that developers don’t have solid grounds for predicting which algorithms will work and which won’t, or for choosing one AI architecture over another. To put it succinctly, AI lacks a basis for inference: a solid foundation on which to base predictions and decisions.

This makes AI decisions tough (or impossible) to explain and hurts trust in AI models and technologies — trust that is necessary for AI to reach its potential. As noted by Rahimi, this is an unsolved problem in AI and machine learning that keeps tech and business leaders up at night because it dooms many AI models to fail in deployment.

Fortunately, help for AI teams and projects is available from an unlikely source: classical statistics. This article will explore how business leaders can apply statistical methods and statistics experts to address the problem.

Holdout Data: A Tempting but Flawed Approach

Some AI teams view a trained AI model as the basis for inference, especially when that model predicts well on a holdout set of the original data. It’s tempting to make such an argument, but it’s a stretch. Holdout data is nothing more than a sample of the data collected at the same time, and under the same circumstances, as the training data. Thus, a trained AI model, in and of itself, does not provide a trusted basis for inference for predictions on future data observed under different circumstances.

What’s worse, many teams working on AI models fail to clearly define the business problem to be solved. This means that the team members are hard-pressed to tell business leaders whether the training data is the right data. Any one of these three issues (bad foundation, wrong problem, or wrong data) can prove disastrous in deployment — and statistics experts on AI teams can help prevent them.

Many IT leaders and data scientists feel that statistics is an old technology that is no longer needed in a big data and AI era. Some business leaders even recall stats as their least favorite course in school and try to avoid it. Hiring often focuses instead on skills associated with machine learning, cloud computing, text analysis, and deep learning — skills that are seen as both sexy and powerful.

While there is a grain of truth to statistics being an old technology, organizations still need it. The discipline of statistics, with its roots in natural sciences and mathematics, teaches statisticians to think of models as approximating some scientific “truth” in a population beyond the data in hand, and to quantify the potential error in doing so. This is the sort of basis for inference we seek for AI.

How Teams Can Apply Statistics to AI Work

So how can leaders bring statistics experts — and their methods — to bear during AI projects to improve the odds of successful AI deployments? Let’s look at four examples.

1. The Right Data

Identifying AI bias is a top challenge for 83% of machine learning professionals, according to a September 2023 survey conducted by Aporia. Fortunately, survey sampling, a discipline within statistics, has developed a deep theory of potential biases in data, including sampling bias, nonresponse bias, biased questions, and many others. These considerations can help AI teams better understand potential biases and limitations in their data sets.

As we have previously written, machine learning models are too often built upon the data sets that are available rather than on the right data, meaning the data most appropriate to solving the problem at hand. The sole focus of statistically designed experiments is to obtain the right data to address a particular problem or question.

2. Randomization

Experimental design also produced one of the greatest breakthroughs in the history of data science: the randomized trial. Randomized clinical trials remain the gold standard in pharmaceutical development, and the A/B tests regularly used by Google, Meta, and other tech companies are basic randomized trials.

It is noteworthy that Abhijit Banerjee, Esther Duflo, and Michael Kremer won the Nobel Memorial Prize in Economic Sciences for their application of randomized experiments to poverty alleviation. Randomization provides an excellent basis for inference because it prevents other variables not in a data set (lurking variables or dark data) from confusing the results. This allows for the determination of causal relationships, not just correlation. Having an understanding of causal relationships is arguably the best possible basis for inference.

3. Model Testing

A third example involves using statistics to design tests of already deployed AI models. Consider a credit scoring model. The company obtains performance data on credit that it issues but is left in the dark on the flip side: Were its decisions to deny credit correct? The company may never know. The only remedy is to grant credit in some cases where the company would ordinarily not do so, just to test the AI model. Designing and evaluating the experiments to grant credit in this type of test falls into the domain of statistics. Some companies we are working with are already doing this.

4. Statistical Process Control

Finally, statistical process control (SPC) provides methods for monitoring processes over time to rapidly detect changes in performance. SPC can be applied to monitoring the performance of AI models after deployment, but few machine learning developers have studied it. When models maintain performance over time, especially on new data, we have another basis for inference; the prediction accuracy is stable over time.

Why Your Team May Need Statistical Twins

Thinking even bigger, statistics can assist teams that are developing AI models via a statistical twin, which is analogous to a digital twin of physical systems. Teams can pair a machine learning model with a more traditional statistical model and develop both at the same time. The machine learning model will almost always provide better prediction accuracy on data in the holdout set, but the statistical model offers some advantages: It features parameters and coefficients with real-world interpretations that are explainable and can be compared to current subject matter knowledge. Statistical models come fully equipped with goodness-of-fit and uncertainty measures, which are great aids in determining whether the right data has been employed and in extrapolating beyond the holdout set.

Bottom line: Developers can calibrate the models against one another and test how closely the predictions match. This is especially important in determining areas (or subpopulations) in which one model or the other performs poorly. There is absolutely nothing wrong with using the “usually superior” model most of the time and the “supplemental model” where it provides superior performance or when explainability, as noted earlier, is critical.

Further, machine learning developers tend to focus narrowly on finding the single best model, typically based on one metric: out-of-sample prediction accuracy. Statisticians are trained to think about models more broadly: Is the model appropriately simple? Where is the model weak and improvable? Are extraneous variables included? Can we pair the model with domain knowledge to suggest a causal relationship?

Using a pair of models developed using different approaches lets data science teams have their analytic cake and eat it too. While statistical methods may not match the power of machine learning models, they certainly augment them by helping fill in the basis-of-inference gap.

More Changes Business Leaders Can Make Now

Business leaders, not the people building AI models, are responsible for the ultimate performance of AI systems and the related machine learning models. Managers must ask the right questions to ensure that modelers employ the right data to solve the right problems. Managers can view statistics as a way of thinking about problems that is used in conjunction with a collection of tools that can help determine the right data — and provide models with firmer bases for inference. Once statistics is seen in this light, integrating statistics methods and talent with AI teams seems like an obvious approach. Doing so represents a huge opportunity to address quality issues that have slowed progress in AI deployments.

As an immediate first step, managers should pose the following additional questions to AI and machine learning model developers:

Does the model match expectations based on domain knowledge? For example, a machine learning model developed to predict adverse outcomes among patients with pneumonia suggested that patients with asthma were less likely to experience negative outcomes. This contradicted known medical science, and upon further investigation, it turned out that the model was not aware that asthma patients typically receive emergency treatment for pneumonia precisely because their risk is so high. What you want to hear: The people with the best domain knowledge have vetted the model.
Could the model be simplified? What you want to hear: Extensive testing has shown that models with any fewer features result in significantly worse prediction accuracy.
What is your basis for inference? In other words, on what basis are you confident that the model will perform well going forward? What you want to hear: Watch out for the claim that out-of-sample prediction accuracy ensures accuracy on future data collected under different circumstances. Better answers will refer to the right data being utilized, randomization, or knowledge of causal relationships.

Next, as soon as possible, managers should bring to bear as much statistics expertise as they or their teams currently have on AI project work. Many teams can, for example, apply statistical techniques to test models, as we described in the credit scoring example above.

Numerous statisticians in tech companies have consistently told us that they feel valued for their machine learning skills, not their statistical skills. The implication is that while many companies already have a critical mass of statistical skills, they are not deploying statistics experts in the right ways. Put them to work doing statistics.

Third, managers must rethink the composition of their data science teams with an eye toward adding technical diversity. Too many teams are made up of data scientists with similar skills, which results in a team that is technically an inch wide and a mile deep. To illustrate, an American football team consisting of all quarterbacks is destined to lose every game. Statisticians, and others with statistical skills, can broaden data science and AI teams.

Further, managers should work with HR departments to take an organizational skills inventory relative to AI, machine learning, and statistics. Many applied statisticians have experience working with nontechnical people and have developed collaboration skills to help define the fundamental business problem at hand. Statisticians learned long ago that the stated problem often turns out not to be the real problem. In AI work, these skills prove critical to properly framing the business problem and obtaining the right data. The skills inventory should delve into this area. Longer term, managers must recruit people with these skills in order to avoid blind spots.

Both tech and business leaders have spent too much time nervously pacing the floor, hoping for the best in AI deployments. Statistical methods can augment current machine learning methods and help provide a basis for inference that instills confidence — one based not on hope but on science.

Topics

About the Author

Thomas C. Redman is president of Data Quality Solutions and author of People and Data: Uniting to Transform Your Organization (KoganPage, 2023). Roger W. Hoerl is the Brate-Peschel Professor of Statistics at Union College in Schenectady, New York, and coauthor with Ronald D. Snee of Leading Holistic Improvement With Lean Six Sigma 2.0, 2nd ed. (Pearson FT Press, 2018).
View More

Tags:

AI Data Statistics

Topics

Share