Back to AI
OK, here we go.
AI is full of many fairly well-developed methods of "propositional" learning: probabilistic, fuzzy, and otherwise fancified versions of stuff we can do with boolean algebra but no quantifiers. In other words, current AI is fairly good at manipulating systems that have a fixed number of variables, but not so good at arbitrary logical structures.
A simple illustration of the two categories:
I give the AI data on 3,000 patients, which includes information about which patients had a list of 20 common symptoms, and what of 10 diseases those patients were ultumately determined to be suffering from. The AI learns to predict disease from symptoms.
I give the AI data on 3,000 patients, with logical descriptions of symptoms and logical descriptions of diseases. Again, the AI is to learn to predict diseases from symptoms.
In the first case, the AI is learning a mapping from 20 variables (the twenty symptoms) to 10 variables (the diseases). In the second case, the AI is learning a mapping from one infinite space (possible logical descriptions of symptoms) to another (possible logical descriptions of diseases).
There are some borderline cases. For example, if we are learning to map arbitrary real numbers to arbitrary real numbers, the space is infinite, and we *could* treat the real numbers as logical entities rather than merely continuous variables, but this is almost never done; so, it is propositional. (Treating the real numbers as logical entities would mean worrying about interesting exact numbers like 1, 2, 3, pi, e, 1/2, and so on. For propositional methods, 1 is not any more special than 1.002.) It gets fuzzier... suppose we are learning about integers rather than real numbers. We could adapt the same propositional strategies we used in the real-number case, restricting them to integers. It might work well. But with integers there is a greater chance that the exact number matters, and a greater chance that the number should be treated as a logical entity. Perhaps a pattern relies on one of the integers being even. A propositional method will not see this, and will need to learn from the data which integers the pattern applies to and which it doesn't. So, it will only be able to extend the pattern to cases it has seen before. If this sort of pattern is likely, more sophisticated methods become critical. Yet, we're still working with a fixed number of variables. The more sophisticated methods become really critical when a given problem instance can contain a totally new variable, yet one that patterns from old variables might apply to. And then they become really really critical when problem instances can't even be totally separated, because they all fit into one big structure of logical relationships...
The transition from the first type of AI to the second is now taking place, largely under the banner of "probabilistic relational models". Hooray!
(I've labeled the second type "predicate" to distinguish it from "propositional". This way of naming the two types is not uncommon, but many people use "relational" to describe the second class, instead. Another convenient term is "logical".)
Anyway, I'm going to outline one way of carrying over the progress that's been made with propositional models to the higher level... this is an idea I've been sort of thinking about on the back burner, partly as a way of making an inherently parallel learning method. By no means do I think this is the only way of going forward, and in fact I'll probably be working in a different direction myself for at least a little while... maybe I'll write about that later.
So, how can we use propositional algorithms in the wider domain (whatever we choose to call it)?
I described propositional models as models having no quantifiers. Really, though, a universal quantifier over all the variables is taken for granted in the propositional setting. If a propositional method learns that a particular variable is "2 or 3", it doesn't mean that for some one case it is either two or three; it means for any case it is either two or three. This single quantifier gives us some ability to do relational-style learning.
Suppose we want to learn a probabilistic model for pictures. We could treat every pixel as a variable, and learn correlations between them from the dataset. This approach would require a very large dataset before any interesting patterns could be found. It would need to re-learn every object in every possible location, because it would have no way of generalizing from one location to another. A smarter way to apply propositional methods to the problem would be to treat a small square, say 10 by 10, as a single "instance"; we learn correlations between variables in such squares. With this approach, we might be able to get something interesting from even one image. There is a cost, of course: where before it was impossible to generalize from one location in the picture to another, now it is impossible to not do so. For example, the method would not notice that an egg appears in the top right corner of each picture; it will simply notice that an egg is a common object. This can be helped, by adding the location information as an extra variable.
More generally, for any structured data, this approach can be used to learn local structure. For linear data such as text, we can use a fixed number of letters rather than a fixed number of pixels. For tree-structured data (such as computer program code, HTML...), we can use a branching segment. But, we can go even further: there is a totally general mapping that will handle any logical structure. Any data can be represented in the form of predicate statements: a series of entities, each of which can have properties and relationships with other entities. Just as we can select a random 10 by 10 area of a picture and ask what the colors are, we can select 10 entities at random and ask what their properties and relationships are. Let's call this the "universal mapping".
The universal mapping allows us to learn logical structure, but it does have some serious limitations. Suppose once again that we're looking at linearly structured data such as text, but in predicate calculus form, and with the universal mappping. We have a bunch of entities ordered by a next/previous relation, and with a "letter" property that distinguishes each entity as 'a', 'b', 'c', ... 'A', 'B', 'C', ... 'space', 'comma', 'period', et cetera. Now suppose that we sample entities at random. If we've got much text, it will be very unlikely that we'll pick letters that occur next to eachother. We're learning a correct distribution, technically, but we're not looking at the "interesting" part of it (the line). The algorithm could eventually learn what it was supposed to, but it would take too much time, and besides that it would not be in a directly usable form (since the distribution we wnat is embedded in a larger, useless distribution). If we used a special-purpose linear mapping that did not sample so randomly, we'd be much better off.
OK. Anyway, that is the "state of the art" so to speak. These are all basically a type of markov model. So, what is my new suggestion?
Well, first, let's look at one more relevant detail of the current propositional algorithms. Many of them are hierarchical in form. This means that rather than finding complex patterns directly in the set of variables, the algorithm creates additional variables, and finds simpler relationships. Simpler relationships that include additional entities amount to the more complex relationships that we could have looked for in the first place; but the process is easier to manage by abstracting it in this way. This abstraction is iterated: second and third levels of hidden variables can be created to manage increasingly complex relationships, which treat the lower levels exactly as the first level treats the visible variables.
Two recent sucessful examples are Hinton's deep belief nets and Numenta's Hierarchical Temporal Memory.
So in the propositional domain, we can talk about hidden variables. In the predicate domain, we can also talk about hidden variables, but we can add hidden relations and even hidden entities to the list. For propositional methods, hidden variables are just an efficiency issue. In the predicate domain, it is very different: hidden stuff dramatically increases the representational power (and therefore the model search space).
Starting simple with hidden variables: if each entity is allowed to posses hidden variables, then the modeling power has changed from markov-model to hidden-markov-model.
Adding hidden entities and relations is enough to boost the representational power up to turing-completeness: any computable pattern can be represented.
So, OK, enough background. Now, the question is, how can we best use the power of existing propositional methods to search a turing-complete model space? Here is the idea I've been pondering recently...
For the moment, let's say we're working with linearly structured data, for simplicity. One turing-complete formalism is the cellular automaton. These bear some structural similarity to a hierarchical network of variables. The major difference is that all cells are the same, meaning that all variables are interacting with their neighbors in exactly the same manner. My basic idea is to semi-smoothly integrate deep belief network learning with cellular automaton learning. For the linearly structured data, then, the system would be searching for two-dimensional automata. Intuitively, learning a cellular automaton of hidden variables is similar to learning the physical laws that hold sway equally at any point in space. We can see directly how those physical laws work for the visible space, and we generalize, assuming that there are nearby hidden spaces that partially explain the visible space, and farther-off hidden spaces, and so on, but that all follow the same physical laws.
Well, that was nearly 20 paragraphs of introduction for 1 paragraph of an idea... but, that introductory material was interesting in itself, and it will be useful to refer to in future posts. So, more later...