## Saturday, December 08, 2007

Why is there no way to give these posts proper titles?

Anyway, I've realized that in the past months, I drifted further and further into bayesianism with my thoughts on AI. That meant not only assuming that the bayesian math was the ideal that an implementation should approximate, but also putting more and more work onto Bayes Nets at the base level of the AI (and therefore spending most of my time figuring out how to extend bayes nets to get them to do what I wanted, rather than thinking about the entire system *around* the bayes nets).

So, realizing this, I put the base-level statistic on the back burner for the moment and went back to theorizing about the system as a whole. (Coming up with a great base-level statistic is still a good idea, but not of ultimate importance, to some extent.)

Here's where I was before I got distracted:

0. (base level) The system uses some set of base-level patterns; these can be almost anything, or a combination of several things, but we must require that there aren't any "cheaters": patterns self-report how well they fit particular data, so some pattern could just claim to be perfect. In other words, the base-level statistic needs to be somewhat smart to start out. We apply the base-level statistic to the data, recording where each possible pattern succeeds and fails. (For example, if we're using neural nets, what we're doing is going through every possible neural net and recording how well it fits each spot in the data. In practice, we can't do that, so we focus on neural nets that we get via standard training algorithms on local data subsets.)

1. (first wraparound) The base level statistic is applied back onto itself: it is used to predict when particular patterns work well and when they work badly. It then is applied back on itself again, to see where these second-order patterns work well and badly. And so on. This forms a hierarchy of patterns, which is good for two things: (1) it can increase the expressive ability of the base-level statistic; for example, if the base-level only looks for correlations between two variables, the hierarchy represents larger correlation sets; (2) it can increase the compressive ability of the models; for example, if the base-level simply records common sequences that can get arbitrarily long, there's no increase in expressive ability, but there is a great potential increase in compression of models; essentially, we're adding the ability to represent repetition in these sequences. A "hierarchical pattern", as opposed to a "base-level pattern", is a pattern that can be expressed using this method.

2. (second wraparound) The system also records a global average success for each pattern. (To be more bayesian, we *could* instead keep a density function representing current belief about each pattern's probability.) These records are examined for patterns, particularly patterns in the hierarchies formed by the 1st wraparound. What are the common features of successful hierarchical patterns? For example, do several hierarchical patterns look similar to each other except for isolated difference? Can we make a fill-in-the-blank type form that summarizes them? What typically goes in the blanks, with what probabilities? I typically call this new sort of pattern a "metapattern".

3. (third wraparound) When the system learns a fill-in-the-blank style metapattern, then the possible blank-fillers should be regarded as now being "the same" in some respect. The third wraparound specifies that whenever this happens, a new property should be invented for this purpose and attached to the elements in the data that satisfy it. This new property can then be used to form new base-level patterns. This allows those things to be grouped together for future purposes, so that the system can learn new patterns that reuse the same type of variable elements without having to go through the whole process of learning a metapattern again. Think of madlibs that reuse blank types such as "noun" and "verb" over and over again. These new types of pattern are what I call "variable patterns".

This system has a few problems, but the one I'm most concerned about is the difference in style of the 3rd wraparound. To me, it doesn't seem to fit. It's not really a "wraparound", it's just a statement of availability of a particular fact to the base level. In fact, the entire series of wraparounds is a statement of where we can look for patterns. Instead, why not go from the opposite direction: let the system look everywhere, except where we explicitly state that it can't?

"Anywhere" means: when we record how well a pattern fits on a particular location, we add that as part of the data. The global record of the strength of different patterns counts as part of the data. The definition of each pattern counts as part of the data.

From this, the third wraparound emerges by itself: so long as we can draw a connection between a pattern-instance in the data and the pattern's abstract record of success, metapatterns will be seen as directly related to the base-level data, and so participation in a metapattern can be seen as a base-level property.

The first problem that occurs is that the system could find may useless correlations here, for example the obvious correlation that will exist between the definition of a pattern and what sort of data it has a high score at. So we add our first restriction:

1. Anything that we can already logically deduce, we are not interested in learning probabilistically.

This restriction cannot be enforced completely, because we'd have to completely determine everything that we can or can't logically deduce; but that's fine. Instead, the system should ignore relationships that are logically obvious: things that it can quickly deduce. (this will be dependent on many things, but the sensitivity is a good thing, not a bad thing. The fact that we must draw an arbitrary line for how long a thing takes to deduce is also forgivable in my opinion.)

So far, I haven't seen the need for any additional restrictions.

The problem now is that I'm nto sure of the generality of this method. What class of patterns is it capable of learning? Since I abandoned the Turing machine as my most general class of patterns, I don't even know how to define the class of patterns I'm concerned with. I know "statements in predicate calculus" basically covers it, but there are several concerns:

1. Is 1st-order predicate calculus good enough? Or do I need higher-order predicate calculus, in which case I'd (personally) just go to using axiomatic set theory?

2. What atomic statements do I need? Do I need equality? Probably. Do I need arbitrary predicate operators that can be defined by partially (by making statements concerning their behavior), or can all needed predicates be composed of existing properties in the data (embedded in quantified boolean-logic statements)?

Of course, I'm not considering literally using predicate logic; what I mean is not "do I need X in predicate logic?" but "do I need something in my system that would translate to X if I translated everything to predicate logic?".

Before I can even address those concerns, however, another problem jumps out: how can I represent quantifiers to begin with? First, the quantifying I'm doing is probabilistic where predicate logic is not. That seems fine, though. The real problem is that the system sort of implicitly quantifies. Patterns state what variables they quantify over, but are silent about what variables they might be holding still. The system then takes the set of variables being quantified over and sort of roughly quantifies over them all at once; but in predicate logic, the order in which we quantify is important, so the two approaches are at odds.