Friday 26 February 2016

Hierarchical Bayesian: Automatic Occam's Razor

There is a crisis of "reproducibility" right now in science (Prinz et al 2011; Simmons et al 2011; Begley et al 2012): conservative estimates are that more than 50% of scientific articles are false and cannot be reproduced. Bad estimates are rampant, and false-positives are the norm! This new 'digitial dark age' has many culprits. One contributor is, I think, a misuse and misunderstanding of the dominant statistical paradigm called "Frequentism". Frequentism is a great intellectual achievement, and has such lofty ideals as 'objectivity', a.k.a., let the data speak for itself. Let's not inject our own hopes and desires into our scientific research. Sounds great doesn't it? Why would we want otherwise?

Well, for one thing, researchers (especially dolphin researchers) rarely have enough data to use Frequentist tools. Thus, we violate the core principle of Frequentism, for which its name is derived:

In frequentism, it is the observed data that is the random variable. Had we repeated the experiment in a parallel universe, we would get different data. We don't care about the actual data values (like individual dolphin's measurements); instead, we care about the true population-level process which gives rise to this data. We try to make inferences about this true process by sampling from the data distribution, and then try to generalize over the entire data distribution.

The core principle is that the Frequentist is trying to use a small sample to characterize the entire data distribution. This requires a lot of data. And when we lack such data (because, heck, dolphins are difficult to find!), Frequentist estimates start to take on rediculous values. For example, I can't tell you how many times a dolphin researcher using a Mark-Recapture model has estimated a survival of 100%. I am serious! There is a big problem with our supposedly 'objective' statistics when they lead us to believe that dolphins are immortal. But here is the catch: my knowledge that dolphins age and die is a type of "prior information", and for me to use this prior biological knowledge leads us to another rival paradigm called Bayesianism.

I won't get into the philosophy of Bayesianism (partially because I don't quite understand it). The best I can do is imagine a Bayesian as a poker player: they don't consider the data in their hand (e.g., their current cards) to be random: they consider it a given, and they want to make the best predictions possible about the next cards to be drawn, conditional on the evidence in their hand. Seems reasonable enough, but it turns out this is in direct opposition to Frequentism.

Parsimony and Hierarchical Bayes


But, one nice feature about Bayesianism is that we can encode into our statistical models an old ideal from Occam: "parsimony". Basically, we say "Don't tell me a dolphin is immortal unless you have a lot of data!" Technically, we can use "hyper-prior" distributions on parameters like survival to force simplicity in lieu of strong evidence, but allow for more complex models if there is a lot of data and larger sample size. This is demonstrated nicely in the following animation.


Look at the red Maximum Likelihood Estimates on the Right (a Frequentist tool). They are estimating a probability p  from a series of 0-1 outcomes (like, the number of dolphins captured from N available), and this is done in parallel during 10 different "groups" (like 10 independent experiments). Notice that the ML Estimates  stay the same over a wide range of data/evidence (even though the "true" mean is around 0.5), and they vary from p=0.2 to p=1. However, the Hierarchical Bayesian posteriors (in purple) behave very differently. When the sample size is small (5 observations per group), each group is shrunk to the overall mean at 0.5. It is only as the sample size increases that each group is allowed to take on their own individual values; and eventually, with a lot of data, each groups' posterior expectation becomes nearly the same as the MLE.

This is a type of "model parsimony": with only a little bit of data, our model is simple (all groups have the same estimate, and the estimate is the group mean at 0.5). With more data, our model becomes more complex (all groups share a mean, but they can also have individual variation, each with their own parameter p).

The abuse of MLE is nicely illustrated with the following experiment: Imagine you want to estimate the fairness of a coin, but flip it only 3 times. Let's say you observed TTT (a rare event, but not impossible). The Maximum Likelihood Estimate would be that the probability of a tail is 100%, and the coin is definitely not fair. Of course, we know this is a terrible estimate and a stupid example, because we have solid prior information about the behaviour of coins. However, the sad fact is that many Dolphin Mark-Recapture researchers are making similarly stupid experiments, under small sample sizes and overally complex Frequentist models which estimate of 100% detection probability or 100% survival or 0% movement. These terrible estimates are polluting our scientific literature.

The MLE's are indeed the most 'objective', but they make bad predictions for new data and give us rediculous results under low-sample sizes. In contrast, a little bit of skepticism encoded in a Bayesian prior (like Beta(1,1)) can temper our estimates. Returning to the coin flipping example, the posterior would say that the probability of a tail is just 80%, given the observed sequence TTT. That's still not the true 0.5, but it is a better prediction than the frequentist's 100%

Unified Under Hierarchical Bayes


The poor performance of MLE was noticed by non-statisticians a long time ago in the Machine Learning and predictive analytics communities. They obsessed over "regularization": finding ways to constrain the complexity of models, and deliberately bias the estimates towards simplicity. Techniques like the LASSO, Ridge Regression, Elastic Net, Boosting, AIC optimization, Bayesian Model averaging, etc, were devised to make better predictions on new data, rather than philosophical notions of probability and objectivity.

The eerie thing is that most of these techniques have a Bayesian interpretation. The famous Lasso is just a Bayesian model with a Double Exponential prior on the regression coefficients. The idea is explored in an accessible article by Hooten and Hoobs 2015.

Subjective Pseudo Bayesian and the Degradation of Science


If ecologists are willing to inject a little bit of bias into their analyses in order to get better predictions and sensible results, then what of objectivity? What is to stop one from 'cooking the books'? Talking about cavalier 'pseudo'-Bayesian analysis, the Bayesian heavy-weight James Berger wrote in 2004:
Statistics owes its central presence in science and life to the facts that (i) it is enormously useful for prediction; (ii) it is viewed as providing an objective validation for science. Discarding the latter would lead to a considerable downgrading of the value placed on statistics.

At the time of his warning, few non-statisticians were using subjective Bayesian models, and even then, they seemed to want "objective" reference priors which conveyed no prior information, and yielded nearly identical results as Frequentist models. It is a bit ironic that the opposite seems to have happened: the crisis of reproducibility have been unveiled, and it was under the watch of the so-called objective school of statistics, and not from subjectively biased Bayesian models. Now that the tables of turned, and more and more researchers are turning to Bayesian models, Berger's 2004 warning sits, waiting to be realized. The solution is probably everyone adopting a prediction mentality. Dolphin ecologists have already sort of adopted this practise by using the AIC for model selection, but have yet to grapple with its philosophical underpinnings, nor rival ideas.

No comments:

Post a Comment