Let's take another look at the two statements I wanted to make about probability.

1. A probability asserted by a human is really a believed frequency.

2. A probability is a statement of uncertainty that could always be turned into certainty given more information.

If the second statement is true, then any probabilistic model is inherently incomplete. This means that there are no single-event probabilities; any "believed frequency" I assert is wrong unless it's 0 or 1, because a single event can only happen or not, and it's meaningless to give a probability.

What is meaningful however, is giving a probability based on the limited information at hand. In doing so, we give our belief about the frequency that would result from looking at all situations that have the same givens; situations that would look the same from our perspective, but might turn out differently. I'd argue that this is basically what people intend when they give such probabilities.

However, this also has some snags: I can't seriously assert that people believe such parallel situations always literally exist. People could give probability estimates in unique situations that never occurred before and may never occur again.

To get past this, I reformulate my statement:

People give probability estimates (1) based on the limited information at hand, (2) only using relevant information, and (3) ignoring potentially relevant information if it doesn't match any previous experience and so doesn't help predict.

The first addition, that only information thought to be relevant is used, helps somewhat by reducing the previously crippling number of unique situations. Now, situations can be considered the same if they vary only in ways irrelevant to the event being predicted. The other addition, however, that potentially relevant information be ignored if it turns the situation into a unique situation, is the real clincher. It guarantees that the probability estimate is meaningful.

But there are still problems.

Clause 3 above may fix everything, but it's pretty problematic from the point of view of machine learning. A prediction made by ignoring some evidence should be given lower certainty. The math there is straightforward; we have some probability of the variable being relevant, and we have no idea how it effects things if it is. We therefore weight the two possibilities, adding our prediction of what happens if the variable isn't relevant to an even wash if it is. (This is an inexact statement, but whatever.) So the result is weaker for each ignored item.

The probabilities-of-relevance here are necessary, but to fit them in to the interpretation, must be given the same sort of interpretation; in other words, they've got to be estimates based on the limited amount of relevant information and so on. The "so on" includes a potential infinite regress because we need to again weaken our statements based on any potentially relevant but new variables, and again this involves a probability of relevance, which again must be estimated in the same way, and so on. However, I'm not sure this is a problem. The reason I say this is that I see it as a series of progressively better estimates; in practice, we can cut it off at some point if we need to, and just use the coarsest possible estimates of the next-down level of probabilities. This could be reflected in a further modification:

People give probability estimates based on as much of the relevant information at hand as they can quickly decide the consequences of.

In other words, we don't compute instantaneously, so we may not use all the relevant information at our disposal, or may use some only partially (using it in the most important estimates, but making less important estimates more quickly by ignoring more information).

This basically seems to reconcile the statements I began with, (1) and (2) at the top of the page. However, I'm still not completely sure about (2). I still may want to assert that there are actual alternatives in the world.

## Thursday, December 13, 2007

## Saturday, December 08, 2007

Anyway, I've realized that in the past months, I drifted further and further into bayesianism with my thoughts on AI. That meant not only assuming that the bayesian math was the ideal that an implementation should approximate, but also putting more and more work onto Bayes Nets at the base level of the AI (and therefore spending most of my time figuring out how to extend bayes nets to get them to do what I wanted, rather than thinking about the entire system *around* the bayes nets).

So, realizing this, I put the base-level statistic on the back burner for the moment and went back to theorizing about the system as a whole. (Coming up with a great base-level statistic is still a good idea, but not of ultimate importance, to some extent.)

Here's where I was before I got distracted:

0. (base level) The system uses some set of base-level patterns; these can be almost anything, or a combination of several things, but we must require that there aren't any "cheaters": patterns self-report how well they fit particular data, so some pattern could just claim to be perfect. In other words, the base-level statistic needs to be somewhat smart to start out. We apply the base-level statistic to the data, recording where each possible pattern succeeds and fails. (For example, if we're using neural nets, what we're doing is going through every possible neural net and recording how well it fits each spot in the data. In practice, we can't do that, so we focus on neural nets that we get via standard training algorithms on local data subsets.)

1. (first wraparound) The base level statistic is applied back onto itself: it is used to predict when particular patterns work well and when they work badly. It then is applied back on itself again, to see where these second-order patterns work well and badly. And so on. This forms a hierarchy of patterns, which is good for two things: (1) it can increase the expressive ability of the base-level statistic; for example, if the base-level only looks for correlations between two variables, the hierarchy represents larger correlation sets; (2) it can increase the compressive ability of the models; for example, if the base-level simply records common sequences that can get arbitrarily long, there's no increase in expressive ability, but there is a great potential increase in compression of models; essentially, we're adding the ability to represent repetition in these sequences. A "hierarchical pattern", as opposed to a "base-level pattern", is a pattern that can be expressed using this method.

2. (second wraparound) The system also records a global average success for each pattern. (To be more bayesian, we *could* instead keep a density function representing current belief about each pattern's probability.) These records are examined for patterns, particularly patterns in the hierarchies formed by the 1st wraparound. What are the common features of successful hierarchical patterns? For example, do several hierarchical patterns look similar to each other except for isolated difference? Can we make a fill-in-the-blank type form that summarizes them? What typically goes in the blanks, with what probabilities? I typically call this new sort of pattern a "metapattern".

3. (third wraparound) When the system learns a fill-in-the-blank style metapattern, then the possible blank-fillers should be regarded as now being "the same" in some respect. The third wraparound specifies that whenever this happens, a new property should be invented for this purpose and attached to the elements in the data that satisfy it. This new property can then be used to form new base-level patterns. This allows those things to be grouped together for future purposes, so that the system can learn new patterns that reuse the same type of variable elements without having to go through the whole process of learning a metapattern again. Think of madlibs that reuse blank types such as "noun" and "verb" over and over again. These new types of pattern are what I call "variable patterns".

This system has a few problems, but the one I'm most concerned about is the difference in style of the 3rd wraparound. To me, it doesn't seem to fit. It's not really a "wraparound", it's just a statement of availability of a particular fact to the base level. In fact, the entire series of wraparounds is a statement of where we can look for patterns. Instead, why not go from the opposite direction: let the system look everywhere, except where we explicitly state that it can't?

"Anywhere" means: when we record how well a pattern fits on a particular location, we add that as part of the data. The global record of the strength of different patterns counts as part of the data. The definition of each pattern counts as part of the data.

From this, the third wraparound emerges by itself: so long as we can draw a connection between a pattern-instance in the data and the pattern's abstract record of success, metapatterns will be seen as directly related to the base-level data, and so participation in a metapattern can be seen as a base-level property.

The first problem that occurs is that the system could find may useless correlations here, for example the obvious correlation that will exist between the definition of a pattern and what sort of data it has a high score at. So we add our first restriction:

1. Anything that we can already logically deduce, we are not interested in learning probabilistically.

This restriction cannot be enforced completely, because we'd have to completely determine everything that we can or can't logically deduce; but that's fine. Instead, the system should ignore relationships that are logically obvious: things that it can quickly deduce. (this will be dependent on many things, but the sensitivity is a good thing, not a bad thing. The fact that we must draw an arbitrary line for how long a thing takes to deduce is also forgivable in my opinion.)

So far, I haven't seen the need for any additional restrictions.

The problem now is that I'm nto sure of the generality of this method. What class of patterns is it capable of learning? Since I abandoned the Turing machine as my most general class of patterns, I don't even know how to define the class of patterns I'm concerned with. I know "statements in predicate calculus" basically covers it, but there are several concerns:

1. Is 1st-order predicate calculus good enough? Or do I need higher-order predicate calculus, in which case I'd (personally) just go to using axiomatic set theory?

2. What atomic statements do I need? Do I need equality? Probably. Do I need arbitrary predicate operators that can be defined by partially (by making statements concerning their behavior), or can all needed predicates be composed of existing properties in the data (embedded in quantified boolean-logic statements)?

Of course, I'm not considering literally using predicate logic; what I mean is not "do I need X in predicate logic?" but "do I need something in my system that would translate to X if I translated everything to predicate logic?".

Before I can even address those concerns, however, another problem jumps out: how can I represent quantifiers to begin with? First, the quantifying I'm doing is probabilistic where predicate logic is not. That seems fine, though. The real problem is that the system sort of implicitly quantifies. Patterns state what variables they quantify over, but are silent about what variables they might be holding still. The system then takes the set of variables being quantified over and sort of roughly quantifies over them all at once; but in predicate logic, the order in which we quantify is important, so the two approaches are at odds.

Subscribe to:
Posts (Atom)