Sunday, December 21, 2008

Back to AI

OK, here we go.

AI is full of many fairly well-developed methods of "propositional" learning: probabilistic, fuzzy, and otherwise fancified versions of stuff we can do with boolean algebra but no quantifiers. In other words, current AI is fairly good at manipulating systems that have a fixed number of variables, but not so good at arbitrary logical structures.

A simple illustration of the two categories:

Propositional Version:
I give the AI data on 3,000 patients, which includes information about which patients had a list of 20 common symptoms, and what of 10 diseases those patients were ultumately determined to be suffering from. The AI learns to predict disease from symptoms.

Predicate Version:
I give the AI data on 3,000 patients, with logical descriptions of symptoms and logical descriptions of diseases. Again, the AI is to learn to predict diseases from symptoms.

In the first case, the AI is learning a mapping from 20 variables (the twenty symptoms) to 10 variables (the diseases). In the second case, the AI is learning a mapping from one infinite space (possible logical descriptions of symptoms) to another (possible logical descriptions of diseases).

There are some borderline cases. For example, if we are learning to map arbitrary real numbers to arbitrary real numbers, the space is infinite, and we *could* treat the real numbers as logical entities rather than merely continuous variables, but this is almost never done; so, it is propositional. (Treating the real numbers as logical entities would mean worrying about interesting exact numbers like 1, 2, 3, pi, e, 1/2, and so on. For propositional methods, 1 is not any more special than 1.002.) It gets fuzzier... suppose we are learning about integers rather than real numbers. We could adapt the same propositional strategies we used in the real-number case, restricting them to integers. It might work well. But with integers there is a greater chance that the exact number matters, and a greater chance that the number should be treated as a logical entity. Perhaps a pattern relies on one of the integers being even. A propositional method will not see this, and will need to learn from the data which integers the pattern applies to and which it doesn't. So, it will only be able to extend the pattern to cases it has seen before. If this sort of pattern is likely, more sophisticated methods become critical. Yet, we're still working with a fixed number of variables. The more sophisticated methods become really critical when a given problem instance can contain a totally new variable, yet one that patterns from old variables might apply to. And then they become really really critical when problem instances can't even be totally separated, because they all fit into one big structure of logical relationships...

The transition from the first type of AI to the second is now taking place, largely under the banner of "probabilistic relational models". Hooray!

(I've labeled the second type "predicate" to distinguish it from "propositional". This way of naming the two types is not uncommon, but many people use "relational" to describe the second class, instead. Another convenient term is "logical".)

Anyway, I'm going to outline one way of carrying over the progress that's been made with propositional models to the higher level... this is an idea I've been sort of thinking about on the back burner, partly as a way of making an inherently parallel learning method. By no means do I think this is the only way of going forward, and in fact I'll probably be working in a different direction myself for at least a little while... maybe I'll write about that later.

So, how can we use propositional algorithms in the wider domain (whatever we choose to call it)?

I described propositional models as models having no quantifiers. Really, though, a universal quantifier over all the variables is taken for granted in the propositional setting. If a propositional method learns that a particular variable is "2 or 3", it doesn't mean that for some one case it is either two or three; it means for any case it is either two or three. This single quantifier gives us some ability to do relational-style learning.

Suppose we want to learn a probabilistic model for pictures. We could treat every pixel as a variable, and learn correlations between them from the dataset. This approach would require a very large dataset before any interesting patterns could be found. It would need to re-learn every object in every possible location, because it would have no way of generalizing from one location to another. A smarter way to apply propositional methods to the problem would be to treat a small square, say 10 by 10, as a single "instance"; we learn correlations between variables in such squares. With this approach, we might be able to get something interesting from even one image. There is a cost, of course: where before it was impossible to generalize from one location in the picture to another, now it is impossible to not do so. For example, the method would not notice that an egg appears in the top right corner of each picture; it will simply notice that an egg is a common object. This can be helped, by adding the location information as an extra variable.

More generally, for any structured data, this approach can be used to learn local structure. For linear data such as text, we can use a fixed number of letters rather than a fixed number of pixels. For tree-structured data (such as computer program code, HTML...), we can use a branching segment. But, we can go even further: there is a totally general mapping that will handle any logical structure. Any data can be represented in the form of predicate statements: a series of entities, each of which can have properties and relationships with other entities. Just as we can select a random 10 by 10 area of a picture and ask what the colors are, we can select 10 entities at random and ask what their properties and relationships are. Let's call this the "universal mapping".

The universal mapping allows us to learn logical structure, but it does have some serious limitations. Suppose once again that we're looking at linearly structured data such as text, but in predicate calculus form, and with the universal mappping. We have a bunch of entities ordered by a next/previous relation, and with a "letter" property that distinguishes each entity as 'a', 'b', 'c', ... 'A', 'B', 'C', ... 'space', 'comma', 'period', et cetera. Now suppose that we sample entities at random. If we've got much text, it will be very unlikely that we'll pick letters that occur next to eachother. We're learning a correct distribution, technically, but we're not looking at the "interesting" part of it (the line). The algorithm could eventually learn what it was supposed to, but it would take too much time, and besides that it would not be in a directly usable form (since the distribution we wnat is embedded in a larger, useless distribution). If we used a special-purpose linear mapping that did not sample so randomly, we'd be much better off.

OK. Anyway, that is the "state of the art" so to speak. These are all basically a type of markov model. So, what is my new suggestion?

Well, first, let's look at one more relevant detail of the current propositional algorithms. Many of them are hierarchical in form. This means that rather than finding complex patterns directly in the set of variables, the algorithm creates additional variables, and finds simpler relationships. Simpler relationships that include additional entities amount to the more complex relationships that we could have looked for in the first place; but the process is easier to manage by abstracting it in this way. This abstraction is iterated: second and third levels of hidden variables can be created to manage increasingly complex relationships, which treat the lower levels exactly as the first level treats the visible variables.

Two recent sucessful examples are Hinton's deep belief nets and Numenta's Hierarchical Temporal Memory.

So in the propositional domain, we can talk about hidden variables. In the predicate domain, we can also talk about hidden variables, but we can add hidden relations and even hidden entities to the list. For propositional methods, hidden variables are just an efficiency issue. In the predicate domain, it is very different: hidden stuff dramatically increases the representational power (and therefore the model search space).

Starting simple with hidden variables: if each entity is allowed to posses hidden variables, then the modeling power has changed from markov-model to hidden-markov-model.

Adding hidden entities and relations is enough to boost the representational power up to turing-completeness: any computable pattern can be represented.

So, OK, enough background. Now, the question is, how can we best use the power of existing propositional methods to search a turing-complete model space? Here is the idea I've been pondering recently...

For the moment, let's say we're working with linearly structured data, for simplicity. One turing-complete formalism is the cellular automaton. These bear some structural similarity to a hierarchical network of variables. The major difference is that all cells are the same, meaning that all variables are interacting with their neighbors in exactly the same manner. My basic idea is to semi-smoothly integrate deep belief network learning with cellular automaton learning. For the linearly structured data, then, the system would be searching for two-dimensional automata. Intuitively, learning a cellular automaton of hidden variables is similar to learning the physical laws that hold sway equally at any point in space. We can see directly how those physical laws work for the visible space, and we generalize, assuming that there are nearby hidden spaces that partially explain the visible space, and farther-off hidden spaces, and so on, but that all follow the same physical laws.

Well, that was nearly 20 paragraphs of introduction for 1 paragraph of an idea... but, that introductory material was interesting in itself, and it will be useful to refer to in future posts. So, more later...

Friday, December 05, 2008

A General Theory

Well, I have been climbing the ladder of infinities for a while now on this blog. Soon I must get back to how this all applies to AI... but not quite yet.

Here is a theory that almost works.

Start with first-order logic. This can be your favorite non-classical variation if you wish; intuitionistic, paraconsistent, relevant, whatever you want. There is only one requirement: it needs to be strong enough to be Turing-complete (more specifically, logical consequence should be completely enumerable but not co-enumerable, thanks to the good old halting problem). Call this the base language.

Add to this your favorite theory of truth. For maximum effectiveness, it should include the infinite hierarchy of truth that I suggested in this post. This is not a difficult requirement: either a revision theory of truth or a fixed-point theory will do, as well as many less-well-known theories, I'm sure... and anyway, the requirement is not totally necessary, as I'll attempt to make clear. Anyway, call this language the language of truth. The theory of truth that you choose will assign actual truth-values to the statements in this language.

Now we have what we consider a solid theory of truth. But, as I pointed out in the message I copied in the previous post, all such theories appear to have referential gaps: some statements will have a status not nameable within the theory, which we can name and reason about as people standing outside of the theory. The type of referential gap will depend on the type of theory of truth that was used. In the most common case, the gap will be sentences that are not assigned either "true" or "false". The theory of truth will be able to state that a sentences is true, or false, but not that it is neither. Defenders of such theories of truth will attempt to claim that we really can't say that; for example, one way of arguing is saying that such sentences have undefined truth values, but we could later add logical conventions that ascribe true or false to them. However, such arguments are self-defeating: the argument needs to refer to the class of sentences that are in this intermediate state, so it generally must invent a label for them (such as "undefined"). This name is precisely the reference gap of the theory, and cannot be stated inside the theory.

So, the next step is to add to the language whatever labels we need in order to fill the gap(s) that exist in the theory of truth. I call this stage a theory of meaning, because I think it is important to point out that the theory of truth is not incomplete just because of the gaps; it may be a complete and valid theory of truth, it just is not a complete theory of logical/mathematical reference. Call the new gap-filling language the first language of meaning. I will generally pretend that there is only one gap, as in the simple case. Call this gap 1-meaningles. (The idea doesn't seem hard to modify to approaches that create multiple gaps.)

Assigning truth values to this language can generally be accomplished by relying on the original theory of truth that we chose. First, label the 1-meaningless sentences as such. Second, modify the theory of truth to act upon 3 truth values rather than the usual 2. This will involve some decisions, but as you can see I am not too concerned with details in this post. Generally, we'll have to decide things like whether it is 1-meaningless or just false to claim that a 1-meaningless sentences is true. Once we've made these decisions, we simply apply the method.

This, of course, creates another gap; having 3 truth values rather than 2 is not so radical as to change the result there. Call the new gap 2-meaningless, and call the language that includes it the second language of meaning. Assign truth-values to this language in the same way, by expanding the method to include 4 truth values.

By now you get the idea. We define 5-meaningless, 6-meaningless, and so on. And if you read the first post I mentioned (this post), then you'll probably also realize that I want to similarly define infinity-meaningless, inf+1-meaningless, inf+inf-meaningless, and so on. More specifically, I want to define a type of meaninglessness corresponding to every ordinal number. As I hinted at the beginning, this rather large hierarchy should smooth out most differences in referential power of the methods involved; so, a really weak initial theory of truth should still do the trick in the end, gaining maximal referential power after unstateably many iterations.

Now for the pleasant surprise. Once I've done this, I can prove that there is no referential gap left. If I had a final gap, it would correspond to an ordinal number larger than all other ordinal numbers (including itself)! This cannot be, so the theory is safe, almost as if it were protected by a magic charm.

For a few weeks now I've been satisfied with this conclusion. But, early on, I realized that I didn't know what the logic should say about a statement like "This sentence is either false or some type of meaningless". A day or so ago I realized what was going on. Each new language can refer to any combination of the truth-states from the levels under it, but obviously there is no top level (since there is no ordinal larger than all others), so we don't have permission to refer to any combination of values from any level we want; we are only allowed to refer to combinations that have some upper bound. The phrase "some type of meaningless" has no upper bound; it attempts to refer to the entire unbounded list.

There is some legitimate mathematical tradition that could be used to justify this limitation of the logic. One is not supposed to tamper with all ordinals at once. So, I could simply say that it's OK to stop here. Of course, I'm using the same sort of self-defeating argument that is so common in this area... in order to say that we can't refer to the list of all ordinals, I am doing exactly that.

Another way out would be to allow the newly-problematic statements to be both true and false, using some paraconsistent logic. This is similarly justified by the motto "one is not supposed to tamper with all ordinals at once". The taboo surrounding the list of all ordinals arises, of course, from the fact that contradictions follow quite quickly from reasoning about it. So, I could somewhat legitimately argue that it is literally an inconsistent yet well-defined mathematical entity.

However, this does not give me maximal expressive power... I will end up wanting to invent new notation to fill reference-gaps in the paraconsistent theory, and on we go.

So, a more direct approach would be to allow unrestricted statements about ordinals, and then do the same thing we've been doing... assign these statements truth values in the obvious ways, then apply an expanded theory of truth to add a truth predicate to that language, then call anything that isn't assigned a value "super-meaningless", expand the theory of truth to give that a predicate, invent 2-super-meaningless, 3-super-meaningless, inf-, and the whole ordinal hierarchy again. Then what? Well, we'll have the same sort of gap for this second ordinal hierarchy. So by doing the whole thing again, we can create super-super-meaningless, super-super-super-meaningless, and infinite versions with any ordinal number of supers . What next? Well, we'll have the same problem again...

But notice what I'm doing...

All of this is quite seriously illegal if we take the notion of ordinal number seriously, because I'm simply constructing a hierarchy of ordinals larger than all ordinals. An ordinal cannot be larger than all ordinals! And even less can there be a whole hierarchy up there...

This demonstrates the force of the two limitative arguments (either "no, you seriously cannot refer to those things" or "sure, go ahead, but you'll derive contradictions"). Even though there really really seems to be a solid next step that can be taken, it is either a brick wall or a gaping ravine...

So, like I said, it seems to be a theory that almost works.