Machine learning and artificial intelligence wouldn’t be possible without the statistical models that underpin their analytic capabilities. A Cornell statistician and his colleague have developed a revolutionary new method to analyze complex datasets that’s more flexible, accurate and easy to use.
Dan Kowal, associate professor of statistics and data science, a shared department in the College of Agriculture and Life Sciences, the School of Industrial and Labor Relations and the Cornell Ann S. Bowers College of Computing and Information Science, is lead author of “Monte Carlo Inference for Semiparametric Bayesian Regression,” which published Oct. 1 in the Journal of the American Statistical Association. Co-author is Bohan Wu, now a Ph.D. student at Columbia University.
“This method gives people more power when they’re working with messy data and trying to untangle the complexities of various effects,” Kowal said. “I want people to be using reliable models so that they can really tease out the signal from the noise. We’ve found empirically that this method can do that across a broad array of different data types, distributions and settings. It’s exactly the kind of contribution that excites me as a statistician.”
Bayesian regression analysis enables researchers to predict a range of outcomes instead of a single estimate. Kowal’s model is specifically designed to analyze “messier data” that doesn’t fit nicely into a bell curve, he said. It can analyze and make predictions on a huge variety of topics, including health care utilization, family incomes, financial markets and climate events. For example, doctors sometimes ask their patients to self-report on their mental health with questions like, “How many days in the last 30 days was your mental health not good?” A large number of people answer “0,” and another large number answer “30,” and the rest generally estimate by answering in increments of 5 or 7, Kowal said.
“With data like this, you get these spikes in the response that are more about the self-reporting than they are about the data itself,” he said. “If I’m trying to plan for health care capacity, I shouldn’t make decisions based on whether people are answering 14 vs. 15 vs. 16. But having models that can appropriately stretch out or compress these clumped data points enables your analysis to make more sense and ultimately be more useful.”
Kowal’s new method is also easier for researchers to use. Bayesian regression analyses typically require use of a complex algorithm (called Markov chain Monte Carlo) that requires a huge amount of computing power and multiple diagnostics to ensure the algorithm itself doesn’t break. Kowal’s method avoids that algorithm.
“When people use Markov chain Monte Carlo, they have to do all types of diagnostics to make sure things are working well. The algorithm requires its own effort, independent of the model and the data you really care about,” he said. “In this paper, we actually completely circumvent that but still retain model flexibility and accuracy in predicting outcomes.”
Kowal has built a website with documentation and examples of how to use his new method, and he’s published free, downloadable software on CRAN, the premier website for open-source programming for statistical computing.
Krisy Gashler is a writer for the College of Agriculture and Life Sciences.