About the Automatic Statistician project

Making sense of data is one of the great challenges of the information age we live in. While it is becoming easier to collect and store all kinds of data, from personal medical data, to scientific data, to public data, and commercial data, there are relatively few people trained in the statistical and machine learning methods required to test hypotheses, make predictions, and otherwise create interpretable knowledge from this data. The Automatic Statistician project aims to build an artificial intelligence for data science, helping people make sense of their data.

The current version of the Automatic Statistician is a system which explores an open-ended space of possible statistical models to discover a good explanation of the data, and then produces a detailed report with figures and natural-language text. While at Cambridge, James Lloyd, David Duvenaud and Zoubin Ghahramani, in collaboration with Roger Grosse and Joshua Tenenbaum at MIT, developed an early version of this system which not only automatically produces a 10-15 page report describing patterns discovered in data, but returns a statistical model with state-of-the-art extrapolation performance evaluated over real time series data sets from various domains. The system is based on reasoning over an open-ended language of nonparametric models using Bayesian inference.

Kevin P. Murphy, Senior Research Scientist at Google says: "In recent years, machine learning has made tremendous progress in developing models that can accurately predict future data. However, there are still several obstacles in the way of its more widespread use in the data sciences. The first problem is that current Machine Learning (ML) methods still require considerable human expertise in devising appropriate features and models. The second problem is that the output of current methods, while accurate, is often hard to understand, which makes it hard to trust. The "automatic statistician" project from Cambridge aims to address both problems, by using Bayesian model selection strategies to automatically choose good models / features, and to interpret the resulting fit in easy-to-understand ways, in terms of human readable, automatically generated reports. This is a very promising direction for ML research, which is likely to find many applications at Google and beyond."

The project has only just begun but we're excited for its future. Check out our example analyses to get a feel for what our work is about.