Wednesday, May 02, 2007

Having read more, the argument I gave in the last post obviously aplies to what's called "context-dependent grammars" even better than it applies to context-free ones. This is nice, because context-dependent is more powerful than context-free. Also, I'm not sure there are any other attempts to learn context-dependent grammars. But the question still remains: can the method learn unrestricted grammars?

In a way, it seems silly to try to learn unrestricted grammars. What does it mean? The simplest characterization is: sort through the space of all possible programs in some manner, looking for the ones that output the data so far (but which don't stop there). Use the further output of these programs to predict the future. (Either choose one that's somehow best, or somehow assign different weights to each, or treat them all equally.)

This sort of patternfinding is on a level we wouldn't even expext of a human. The output of an arbitrary computer program can be pretty hideous-- completely tangled, with every value dependent on every other. Yet, at the same time, scientists can go further than the class of "unrestricted" patterns, into the incalculable. The class of unrestricted patterns is actually defined as any calculable pattern. However, not all equations in physics are calculable. We suspect that the equations hold in all cases simply because they seem to hold in the cases we can calculate. So perhaps the "unrestricted" class of patterns isn't exactly the goal I should be after-- a general learning algorithm shouldn't be expected to learn all patterns of the unrestricted class, but it should be able to learn some that don't fall within it; patterns that are calculable only in some cases. (Here's an image of what it means for a pattern to be "incalculable": there is some sort of exact definition for it, but this definition does not imply any sort of algorithm for calculating the pattern; the only way to work out the pattern is with brute-force theorem-proving to get algorithms for any special cases that you need to know, and no such theorem-proving can ever cover all cases. Neat, huh?)

Actually, I suppose it's possible (in principle) to construct computer programs that will compute uncomputable functions. They'll simply run forever if you ask for an incomputable case. This suggests, then, that a learner of computable unrestricted patterns may very well learn some partially uncomputable patterns by accident, because it can't go through and check to see if a function is computable for every case. Neat.

But back to the more important issue: what would learning unrestricted patterns look like? Searching through the space of all programs that produce the data doesn't provide a very good image, because it's not very dissectable. For example, it doesn't allow us to ask questions like "if this past history had occured instead, what future would you predict using the same patterns?" I think a slight improvement, without loss of generality, would be to search for possible programs that make fairly good predictions of the next moment when fed the last. This can be thought of as a "computed markov model"-- rather than using a table of probabilities, use an algorithm. Obviously, stopping at 1st-order computed markov models would be a bad idea (because in real life all sensory input from one second ago will not tell you everything you need to know about the next moment by a longshot)-- the higher the order, the better. In fact, it might be desireable to search for programs that make fair preductions when fed arbitrarily-sized chunks of the past leading up to various sample moments.

A penalty for complexity, however, is also to be desired. Otherwize the program can just memorize all of the answers.

Anyway, this is an easier question to ask about the system I've been discussing: given enough data, can the system learn arbitrary algorithms for calculating the next moment from some portion of the past?

It is interesting to note that, simply because the system can learn regular grammars, it can actually learn to predict the next step in a turing machine, possible next steps in a proof, et cetera. That is, it can predict arbitrary algorithms if it is allowed to see all of the "work" inbetween input and output. However, obviously the world hides its work from the observer. So can the system learn to calculate "hidden work"? This would mean something like treating the past as an algebraic statement, and performing manipulations on that statement "off to the side" until some reduction to a particular form is acheived. Then, that form must recognized as a sign to stop, and translated (via regular means) into the answer.

Since the system could in theory learn any such "algebraic manipulations" as a new type of relation (not found in the dataset), it could perform the off-to-the-side calculations. The question is: could it learn to do so? Also, could it recognize the halt-state properly?If the answer to these two questions is yes, then it could learn the arbitrary transforms.

To recognize the halting state, and utilize the result of the performed calculation, I think it would only need context-free means. All that's necessary is to recognize the arbitrarily long calculation as a single object with an answer at the end, and store the learned fact that this answer corresponds in some simple way to the future. To recognise all the scratchwork as a single entity requires only a simple recursively defined entity: scratchwork = valid-step -> scratchwork. So can particular sorts of scratchwork be learned?

For the current system to learn some sort of scratchwork without me realizing it, it would need to support some sort of "classification cascade": a few seperate classifications can be combined to form a new classification, which can combine with others to form yet more classifications, resulting in an arbitrarily long series of additional classifications resembling a proof. Each new fact may aid in deriving yet more facts, and the cascade can go on for indeterminate amounts of time.

For this to be the case, there needs to be an infinite number of possible classifications. (Otherwize, a cascade couldn't go on for very long.) This is workable, because I allow for classifications of classifications-- and such a meta-classification can potentially be a catch-all for an infinite number of classifications. However, in it's present state, the system can't fully utilize all such classifications; the meta-classifier can be used to identify new classifiers as instances of the infinite class, but still each such new classifier must be created individually; they cannot be created on the fly. An interesting problem. To solve it, we need "anonymous classifiers" (classifiers that we don't bother to give names) defined on-the-spot for entities; all possible 1st-level classifiers that would say "Yes" for the given object are searched through, but without actually bothering to instantiate them, and we test the meta-classifiers on all of them. If one satisfies a meta-classifier, then we know that the object falls into a particular infinite class. (This could be recorded in various ways, I suppose.) A similar procedure must be applied for meta-classes once we've got metametaclasses, and so on.

This sounds a bit like gobbledygook, because I didn't fill in any details at all, but I think the general idea is sound.

Anyway, all of that still fails to guarantee that arbitrary scratchwork could be learned. The fact that I can have some infinite classes of classifiers doesn't even quite guarantee the sort of cascades I want. I've basically got to prove that the classifiers act like statements in predicate calculus. Each can be arbitrarily long, being (in a proper sense) composed of smaller statements, and interesting rules of deriving one from another can occur (meaning arbitrary production rules). After that, I've got to prove that all of this can occur through learning, and that the rules learned are in some sence optimal! Sounds difficult.

So how complicated can classifications get? In classifying classifyers, the system can use any methods it would use for classifying ordinary data. So assuming my previous arguments are correct, that means a class of classifiers can be at least context-dependent in complexity. But this is concerning the sequence associated with the data it represents; it shouldn't be confused with the complexity of the "logical statement" it represents. Can a class represent the intersection of two other classes? (yes.) Can it reppresent the union of two other classes? (Yes, provided there is some reason to form this union, such as a common context or property of any kind. That restriction may be fatal, I don't know. It seems reasonable given that this stuff has got to be learned.) Can it represent the negation of another class? (Hmm. Sounds like a good question. It depends on how I generalize markov models to arbitrary graphs. Is a subgraph defined soly by a set of relations it does contain, or can it be defined also by a lack of particular relations? I don't think allowing definition by a lack of relations would cause any trouble...) What about quantifiers? (Very good question. But what would it mean? "For all classes X, this class suchnsuch"? No. It wouldn't be quantification over classes, it would be quantification over... properties? Quantification in predicate statements isn't over smaller statements, it's over variabbles that make up statements... but it can be interpreted equivalently either way... so yes?? Joint denial over a class of statements (classes) formed by a variable-substitution scheme is all that's necessary? So if I can say such things as "for all classes A, if A satisfies (stuff including the variable A) then this class satisfies (other stuff including the variable A)" I'm good? Hmm...)

Of course, predicate calculus is only one example. If the classes can arrange themselves to imitate any turing-complete system, and learn such arrangements based on data, then I've got what I'm looking for.