Bayesian Convergence: What it Will and Won't Do
In my continuing quest for the proper prior, I happened upon a nice result: under very general conditions, the effect that the prior has on the current belief will vanish as evidence accumulates. Nice! This means that a Bayesian learner will be good regardless of the choice of prior-- the learned beliefs will fit the evidence.
Does this mean the search for the correct prior is needless?
To answer that question, I should first give the convergence result in a bit more detail. (To get it in full detail, see this paper.)
The key assumption that is made is that no model of the world has zero probability. Bayesian learning will never increase the probability of such a model, so this makes sense. The convergence result is most easily understood in the language of likelihood ratios. In this version of bayesian learning (which gives the same end results as other versions, just by different intermediate steps) we start out with the "prior odds," rather than the prior probability, of a model. The prior odds of a model is just the probability for that model divided by the probability against. For each new bit of evidence that comes in, we multiply the current odds by the "likelihood ratio". The likelihood ratio is the probability of the evidence given the model, divided by the probability of that evidence given its negation. (The probability of the evidence given the negation is actually the sum of its probability given each of the other possible models.) Now, as we observe more and more evidence, we multiply again and again to update the odds for each model we're considering. Yet the prior odds remain a constant at the beginning of that long line of multiplication. The odds of a model, then, can become as large as they like regardless of prior, and likewise can become as small as they might. The evidence is what matters, not the prior.
In any case, I am not about to drop my concern about which prior a rational entity should choose. The main reason for this is that the convergence result leaves open the question of the class of models to be considered, which is my primary concern. Even if this were settled, however, the convergence theorem would not convince me that ensuring nonzero probability for each model is sufficient. The reason has to do with predictions.
To make the case as extreme as possible, I'm going to ignore probabilistic models, and only consider deterministic ones. This is actually no restriction at all; any prior over probabilistic models could be seen as a fancy way of specifying a prior over totally deterministic ones. A probabilistic model gives a probability to each possible dataset, and a prior over many probabilistic models can be seen as just a weighted sum of these, giving us a new (possibly more complicated) distribution over the possible datasets. This can be used as a prior. In fact, since it gives the same overall probability for each dataset, it is for practical purposes the same prior; yet the models it considers are deterministic.
Now that we're working with completely deterministic models, the data will either fit or not fit with each model. When it doesn't fit, we throw that model out. The convergence theorem still holds, because the set of models we're considering will keep shrinking as we throw more out; whenever this happens, the probability that belonged to the discredited model will be redistributed among the still-valid ones. Thus the probability of the correct model will continue to increase (since it's never thrown out).
However, this is not much comfort! The relative probabilities of the models still in consideration will not be based on the evidence at all; it will still be based purely on the prior. (The probability from a model that gets thrown out is redistributed, but not evenly.) This means that when we make predictions, the prior is (in a loose sense) the only thing that determines our prediction.
In fact, if the prior assigns nonzero probability to every possible dataset, then the set of models not yet ruled out will contain all possible futures. The only thing that can narrow this down to make a useful prediction is the prior, which may or may not do so in a way dependent on the evidence so far.
Perhaps someone objects: "But then, can't we just require that a prior's predictions do depend on the evidence? Isn't it an obviously silly mistake to construct a prior that violates this?" Unfortunately, simply ruling out these cases doesn't tell us what prior to use. What kind of dependence do we want? I want a prior that can in theory "notice any sort of regularity"; but this includes noticing that the data is just completely random (predictably unpredictable).
In a way, allowing probabilistic models is a very strange move. It's very similar to allowing models that are infinitely large; in a way, a probabilistic model includes information about an infinite number of coin flips, which are used in a well-specified (deterministic) way to decide on predictions. Of course, when we specify a probabilistic model, we don't specify this infinite table of heads and tails; in fact, that's where probability theory gets its power. This is reminiscent of the idea of a "random sequence" being a more fundamental notion then "probability", as discussed in the previous post... but that's enough speculation for today.