Thursday, December 13, 2007

Let's take another look at the two statements I wanted to make about probability.

1. A probability asserted by a human is really a believed frequency.

2. A probability is a statement of uncertainty that could always be turned into certainty given more information.

If the second statement is true, then any probabilistic model is inherently incomplete. This means that there are no single-event probabilities; any "believed frequency" I assert is wrong unless it's 0 or 1, because a single event can only happen or not, and it's meaningless to give a probability.

What is meaningful however, is giving a probability based on the limited information at hand. In doing so, we give our belief about the frequency that would result from looking at all situations that have the same givens; situations that would look the same from our perspective, but might turn out differently. I'd argue that this is basically what people intend when they give such probabilities.

However, this also has some snags: I can't seriously assert that people believe such parallel situations always literally exist. People could give probability estimates in unique situations that never occurred before and may never occur again.

To get past this, I reformulate my statement:

People give probability estimates (1) based on the limited information at hand, (2) only using relevant information, and (3) ignoring potentially relevant information if it doesn't match any previous experience and so doesn't help predict.

The first addition, that only information thought to be relevant is used, helps somewhat by reducing the previously crippling number of unique situations. Now, situations can be considered the same if they vary only in ways irrelevant to the event being predicted. The other addition, however, that potentially relevant information be ignored if it turns the situation into a unique situation, is the real clincher. It guarantees that the probability estimate is meaningful.

But there are still problems.

Clause 3 above may fix everything, but it's pretty problematic from the point of view of machine learning. A prediction made by ignoring some evidence should be given lower certainty. The math there is straightforward; we have some probability of the variable being relevant, and we have no idea how it effects things if it is. We therefore weight the two possibilities, adding our prediction of what happens if the variable isn't relevant to an even wash if it is. (This is an inexact statement, but whatever.) So the result is weaker for each ignored item.

The probabilities-of-relevance here are necessary, but to fit them in to the interpretation, must be given the same sort of interpretation; in other words, they've got to be estimates based on the limited amount of relevant information and so on. The "so on" includes a potential infinite regress because we need to again weaken our statements based on any potentially relevant but new variables, and again this involves a probability of relevance, which again must be estimated in the same way, and so on. However, I'm not sure this is a problem. The reason I say this is that I see it as a series of progressively better estimates; in practice, we can cut it off at some point if we need to, and just use the coarsest possible estimates of the next-down level of probabilities. This could be reflected in a further modification:

People give probability estimates based on as much of the relevant information at hand as they can quickly decide the consequences of.

In other words, we don't compute instantaneously, so we may not use all the relevant information at our disposal, or may use some only partially (using it in the most important estimates, but making less important estimates more quickly by ignoring more information).

This basically seems to reconcile the statements I began with, (1) and (2) at the top of the page. However, I'm still not completely sure about (2). I still may want to assert that there are actual alternatives in the world.