Friday, October 13, 2006

More Metamodels

The last post discussed why we can't build an ultimate solution to AI-- an ultimate model of how to model the world. I've got to admit that I don't particularly like the fact. I want every element of my AI to follow logically from the basic purpose-- when I describe it, I want it to sound like the obvious solution, the right way of doing things. So, although we can't say what model for modelmaking is best, let's discuss some basic elements that model-models should have.

Obviously, the more types of models we can generate with our modelmaking model, the better. That's why AIXI and OOPS use turing-machines; technically, they can do anything. So a good modelmodel should be universal in the sense that it can come up with any particular model.

Second, we need some rating for how good each model is. Since we can come up with any model, there will be a large number of possibilities for each set of data. There are several ways to rate models. For genetic algorithms and neural nets, a common concern is how closely we match the data-- we give each model a score for percent correct on a number of problems. As I explained in the previous post, AIXI and OOPS both use the criteria of 100% matching; no errors are to be tolerated. But this criteria alone did not limit them to only one model. Similarly, with genetic algorithms and neural nets we run into trouble if we only use percent correct. The model with the highest score may be overmatched: it may have memorized the answers for our particular test, but have no idea what to do for any other. The equivalent for AIXI and OOPS would be a program that generates the desired output by quoting it in it's program-- if the data is "1001011010010101", then the program is "say 1001011010010101". No structure has actually been found in the data, or perhaps only an extremely small amount of structure ("say 100101101001 and then stick 01 on the end twice"). In some sense, these models are too complicated. AIXI and OOPS differ in how they define this notion of "complicated", of course. Neural nets and genetic algorithms have a different often-used solution: instead of giving the entire dataset to the program for it to train itself on, we give it only some of the data. Then once it's trained, we test it using the data that we didn't give it-- if it's just memorized the first dataset, it won't do too well on the new data.

I call these two different measures, the percent correct and the measure of memorization, "utility" and "elegance". "Utility" measures how well the model fits the data, and "elegance" measures how likely the model is to work well for other data in addition to the original training set. The term elegance makes more sense when applied to the AIXI length prior and the OOPS speed prior than for the method of withholding some data until the end, and it's true that I'm biased towards the second sort of method-- a measure of elegance based on what the model looks like, how 'nice' a model it is, not based on withholding some data.

Thursday, October 12, 2006


The basic problem in AI is to create a model for how the mind works.

The basic problem for the mind is to create models of the world.

Therefore, the basic problem in AI is to create basic models of how models work. In other words, we've got to figure out how to predict the world based on our limited knowledge of it. The basic problem is: "How do we make models?"

If there is a universal grand solution to AI, then it will be a universal way of making models and applying them to form predictions and to act on those predictions. There have been claims to this accomplishment already. Two such claims are AIXI (see "A Gentle Introduction to the Universal Algorithmic Agent AIXI", Marcus Hutter, 17 January 2003) and OOPS (see "The New AI: General and Sound and Relevant for Physics", Jurgen Schmidhuber, November 2003). Both make the following assumptions: a model is essentially a turing machine program, and a model should reproduce the data perfectly. Apart from that, the two are very different.

AIXI uses the "length prior", which assumes that the shortest turing machine program that can reproduce the data exactly is the best model. Because of this, AIXI requires an infinite amount of time to find this program; it cannot know if there exists some very short program that merely takes a very long time to calculate the output unless it waits for all possible programs (each possible combinations of ones and zeros, each being executed as if they were programs) to run their course.

OOPS uses the "speed prior" instead, which means that it assumes that the shortest program based on calculation time, not code length, is the best model. It is easy to see that there is a sort of loose correspondence between the two; shorter code will generally take less time to execute. But they are vastly different. The speed prior makes the optimal model much easier to find: we run all possible programs in parallel, starting with "1" and "0", then branching "1" to "10" and "11" and branching "0" to "01" and "00", et cetera. As we grow models in this way, some branches fall behind in creating data and some speed ahead; this is because, as we execute the branches, the programs jump around their instruction sets (the ones and zeros) and create data sort of at random, whenever the instruction set says to. Whenever one of these branches makes an output that conflicts with the data, the branch gets pruned. (remember, both AIXI and OOPS require models to conform exactly to the data.) The branch that gets to the finish line first, having reconstructed the entire dataset from it's program, wins. It is selected out of the bunch as the proper model. This may take a while if the dataset is large, but we are still much better off than with AIXI. (Note: there is more to OOPS than the speed prior. It is also a framework for creating smaller programs that get integrated into larger programs that get integrated into larger programs... et cetera. "OOPS" stands for "Optimal Ordered Problem Solver", which is meant to reflect the idea of using the speed prior first on small problems, then on larger and larger problems, integrating the smaller solution-programs as possible substructures of the larger programs, which (to abbreviate the actual methods extremely) are considered as additional options other than "1" and "0". So the more OOPS has learned, the larger and more sophisticated it's programs. But we will ignore this additional structure, because it is irrelevant to the discussion.)

Before I try to point out problems in these methods in particular, let's examine some general ideas. Both of these methods are general models of the model-making process-- not models of the world, but rather models of models of the world. Which one is right? Does it make any sense to ask the question?

Basically, what the question asks is "Which one produces better models"? The one that produces better models is obviously the better one, and if it actually produces better models than any other possible model of models, it is the best, and can be called the True Model of Models-- the universal solution to AI.

Also, we've got to ignore the fact that AIXI takes an extremely long time to calculate. This only proves that if we don't have sufficient resources, the best we can do is OOPS. AIXI might still be the Ultimate Model of Models; OOPS would merely be the easy-to-calculate alternative.

So, can we figure out which is better? Which matches the way the world works most closely? Which will make better models of our world?

This is actually a very strange question. Which one approaches the "real" model of the world most quickly? Which one will more easily happen upon the Ultimate Truth Behind Everything? Both AIXI and OOPS look for the program that will generate the universe-- the first using the length prior, the second using the speed prior.

The length prior assumes that the Ultimate Truth is a short program. The speed prior assumes that it's a fast program. Or, to put it another way, the length prior assumes that the universe was generated randomly, but favoring the short; the speed prior assumes it was generated randomly, but favoring the quick.

Now, we don't know the ultimate program for the universe. But we know a few things about it-- essentially, we know that it generally seems to follow some laws of physics that we've accumulated. So from that, we can start to look at which prior is better. What we've got to do is think of these laws as generated randomly, and try to figure out how they were generated-- we've got to try to come up with a pattern behind the physical laws.

Now, to do this in an objective way, we've got to have a scientific way of distinguishing between good and bad patterns. We can't go around suggesting models for the physical laws at random; we've got to have some criteria... a model that tells us which patterns are right and wrong.... a model to help us make models...

But wait! This means that to judge between different models of how to make models, we've got to already have a model for making models!! We can't decide which one is better unless we decide which one is better!

And that ends my narrative. For now.