As my ideas about AI get more detailed, I come up with more and more subdivisions of the same single concept. Currently, there's two halves to it: the basic datastructure that supports everything (allowing a computer to represent "fluid" thoughts that don't have definite borders, but rather form an interconnected web), and the overall architecture that any intelligent system needs to follow (or, at least, could benefit from) regardless of basic makeup. The purpose of this document is to present the overall scheme, disregarding the datastructure. Hopefully this will make the explanation more understandable (less technical). But, while reading, keep in mind that I'm only ignoring the problem of implementing these ideas on a computer to make them simple. It is a problem, not to be ignored, and I do have something of a solution.

There are two ways to present all of these ideas: starting with the internal (subjective reality inside the mind) and moving outwards, and starting with the external (behavior of an intelligent system) and moving inwards. Since the first follows more closely the order in which I came up with these ideas, I'll present them that way; but at the end I'll explain why it's perhaps more logical if you look at it the other way.

I started with the thought that all intelligence follows the basic design of looking for repeated patterns. I won't argue this point too far, since I'm just going to go and refute it later (thus flows the progression of ideas), but imagine an animal learning to see. Many repeated structures are already inborn; certainly the brain already knows how to pick out lines, and also the animal has some instinctual idea of what members of it's own species look like, and what predators look like. But from the basic building blocks of colors and lines, it must construct man more complicated repeated patterns, such as the various types of plant and animal, various types of terrain, places it recognizes, et cetera. Larger repeated patterns are constructed out of smaller ones; once the animal knows many of the patterns at a given level, it can easily start to look at patterns at the next level up. Each level is treated about the same by the mind; patterns of colors and lines are not fundamentally different than patterns of landscape. The basic parts are just more complicated. So every repeated pattern we find simplifies the picture, making it even easier to find another repeat.

Now, there are some obvious problems with treating all of this as just "finding repeated patterns". Foremost, all of the repeated patterns seem sloppy, so that it's hard to decide exactly what qualifies as (say) a line. But instead of abandoning this approach as hopeless, let's see how far it gets us, and then try to fix the problems by adding to the basic idea instead of by giving up and starting over.

The nice thing about treating everything as finding repeated patterns is that it's a fairly simple thing for simple-minded computers to do. All that's necessary is to count the number of times various sequences of small sizes show up, and when one shows up more than the rest, remember it as a single unit, give it a name ("X"), and then abbreviate the data by sticking "X" anywhere that the pattern shows up. (Sequences of bigger sizes are constructed of multiple smaller ones, so we don't have to look for sequences that are too large.) This is especially easy with one-dimensional data, since we can just replace the sequence with the name. With more complicated structures, like a two-dimensional picture, or a video (effectively 3-dimensional), we can have irregularly-shaped patterns in our data, so we can't just cut out squares (or cubes). But this issue strays into the realm of my datastructure, so I won't go into it.

Now, let's develop this a little further. We can come up with some mathematics for when to consider a repeat significant. Basically, we just want to know when a particular pattern is likely enough that, given the amount of data we've seen, we should expect it to crop up randomly the number of times we've counted it. If so, we can ignore it. But if we count it more times than we should have already, we pay attention.

The expected number of repeats for a given pattern is approximately the probability of that pattern occurring in data exactly it's size, (1/number of possibilities)^(size of pattern), multiplied by the size of the dataset. This is an overestimate, because the pattern couldn't actually fit in that many times; however, multiplying the maximum number that could fit would be an underestimate, and farther off the real number (I suspect). I'd like to figure out the exact formula, but it seems likely that it would be complicated enough to merit abbreviation anyway. But, to continue: this math will give us (approximately) the number of times a pattern should repeat. If the pattern repeats more, it's worth noticing. But it would also be nice if we could get a number by which we can compare the merits of different possible repeats, so that we can select the one that is most significant. The simplest would be (total number of repeats)/(expected repeats), which would equal (total repeats * size of data)/(number of possibilities ^ size of pattern). If this number is above 1, then the pattern is worth noticing; the higher, the more worth noticing. (I would like it if this number were a probability, instead of a number that wanders above/below 1, but again I don't know the proper math, if it exists.)

In addition to being only approximating, this method also ignores the number of patterns we're looking for. If we're only checking a single sequence to see if it works, then the math is (about) right. But if we're checking all possible sequences to see if they're useful repeaters, then a good number (about half?) are likely to be validated just by chance. So, technically, we should take into account the number of patterns we're checking for. This is another area where my math could use improving.

The point of doing all of this math is simply to come up with ideal behaviors we'd like our system too approach; I'd like to have better mathematics because it would set up a more accurate theoretical ideal, but in practice, the system may do approximate math, or may follow the math nearly without "doing math" at all (perhaps following emergent/biological models).

Comparison to other systems: At this point, the system matches up with certain techniques of data compression, in which sequences that repeat sufficiently often are given a shorter name. One major difference, however, is that those systems will recognize a repeating sequence whenever it makes the size of the file smaller, while I'm trying to use more "scientific" math that only recognizes sequences that are unlikely to have repeated so many times.

Now to address the problems to this approach. As mentioned, the major difficulty is simply one of flexibility. We'll look at this first, and our attempts at fixing it will uncover some other less obvious problems.

The basic idea is that, if recognizing only one possibility causes problems, we should recognize several.

If we call the old type of repeated pattern a "static pattern", we can call the new type a "variable pattern". A variable pattern is a set of static patterns that are thought of as "the same" in some sense. The way in which they are "the same" involves how they "act"-- where they tend to pop up. Essentially, we want to look at the context of each pattern; if two patterns have the same or similar enough recurring contexts, then they are merged into a single variable pattern. The likelihood that a group of patterns becomes merged depends on how well that group differentiates itself from the rest of the patterns. (I haven't developed math specifically for this, and though I'm not against it, it will become clear that I don't have to; the idea of a variable pattern will emerge all by itself from a more general system.)

Since a variable pattern is viewed as "the same thing", we can build static patterns that have variable patterns as pieces. For example, if a variable pattern were "vowel" (standing for the static pattern a, e, i, o, or u) then the static pattern "s, w, vowel, m" would match the three words "swim", "swam", and "swum" (along with the two non-words "swom" and "swem"). These new, more flexible static patterns can be combined into more variable patterns, which can be combined into even more complex static patterns, et cetera. We can nest the two in eachother as far as we like.

An aside: This handles "discrete" messiness, not "continuous" messiness. For example, it doesn't help in the least with finding patterns in digital photographs, in which we'd be hard-put to find two pixels with the exact same color value. Variable patterns can't help us too much because they can't get a grip on the situation; we can't easily say that two colors act "the same" based on what contexts they turn up in, because there's just so many possible contexts. Perhaps the variable pattern approach would eventually yield fruit simply because similar colors tend to be next to one another. But it's far more important to look at the relationships between colors; the contrasts; the gradients. I still argue that this has to come down to discrete ideas in oder for a mind to be able to handle it well; for example, in classifying a color, we compare it to other colors, and make greater-than/less-than type comparisons (which are discrete). This relates to datastructure again, but the system should be able to "see" the relation between adjacent colors about as easily as it "sees" the colors themselves. This greatly increases the applicability of discrete methods. Discrete features are essentially various types of discontinuity in the data.

Comparison to other systems: the system I've been describing thus far is notably very similar to a system called "upWrite", which I learned of after developing these ideas myself. This is described in the paper The UpWrite Predictor: A General Grammatical Inference Engine for Symbolic Time Series with Applications in Natural Language Acquisition and Data Compression by Jason L. Hutchens (1999).

Continuing: We now have an idea of a system that would take in data and build up larger and larger variable and static patterns. These patterns can then be used to generate random data that will be somewhat realistic, based on the data we've fed into it. But (of course) there's something missing. The system assumes that the choice between the various possibilities is totally random. If it were trying to construct a text, say, it might capitalize letters randomly, because it decided that it was useful to see "R" and "r" as the same thing. Also, variable patterns should be able to share elements; one variable pattern might be (cat, dog, hamster, fish) while another might be (dog, canine, mutt, hound). Both variable patterns are useful for different purposes.

We might make individual solutions for these two particular problems (and it certainly might be useful for certain purposes), but there is a larger problem looming. The idea of defining sensory objects can only take us so far. Given data, our ability to predict is only based on object-completion; if we see part of an object, we predict that the rest can be found. This can take us a certain distance-- given part of a jaguar, we predict the rest of a jaguar, and run. Given the first event in a sequence of events we've seen before, we predict the future. But it's not really a logically complete way of looking at the world. Don't get me wrong-- humans aren't logical. The idea that we prefer object-completion to rule-completion can explain a good number of "illogical" things we do. (In particular, "confirmation bias" comes to mind. When asked to check the truth of a logical rule "if A then B" (where A and B each stand for something that could be true or false in particular situations), a person will tend look only at situations in which A or B is true, and check to see if the other is true. For example, if a person is presented with a set of face-up playing cards, and told to test the statement "If a card has a mark on the back, then it is a diamond", that person would tend to check the diamond cards to see if they have marks on their backs. Logically, it's just as important to check non-diamonds-- if one of them has a mark on the back, the rule is false. But people check the diamonds first, and will often forget to check the others at all. This is illogical, but it makes sense if people are thinking in terms of objects instead of rules; the people are testing whether their object-completions based on the hypothetical object are correct.) Nonetheless, it's also useful to "conditionalize" the world-- break it up into particular situations. This is very similar to what we need to do to look at the "context" of a pattern when looking for variable patterns; we say "if there's this pattern here, what can I expect to be around it?". Furthermore, it addresses the problem of needing different variable patterns in different contexts-- replacing it with the idea of learning different sets of possibilities for those different contexts.

So: How do we learn logical rules? Well, what we have to deal with is objects with variables. Even if we just use the raw unprocessed data, that's what we've got; it could be pixels with variable color, it could be touch sensors with variable pressure. A logical rule is a relation between variables.

The "logic" I use is actually inspired more by a system of logic developed by Lewis Carroll before his death in 1898, based on term logic, then on modern-day propositional logic or predicate calculus. He first published this in his book "Game of Logic" in 1887, then in a more complete version as "Symbolic Logic" in 1897. The basic system is to represent a Universe of Things using a grid with one dimension for each quality which may be true or false of a particular Thing. (He had a rather clever way of drawing out up to 10 true/false dimensions on a flat sheet of paper, which he gave as an alternative to Venn Diagrams.) A blank grid communicates to us no information about the Universe under discussion. (He capitalized Universe, Thing, and several other important terms in his book.) Each spot on the grid represents all Things under discussion that have the given combination of properties. Statements are made using markers on the grid: a red marker meant "There exists at least one Thing with these properties", and a grey marker meant "There does not exist any Thing with these properties". He gives an example of the universe of books, examining the properties "new" and "english". A grid with four locations results-- new and english, new and not english, not new and not english, and not new, but english. If we give the letter N to "new" and the letter E to "english", using the symbol ~ for "not", we get much shorter names: N E, ~N E, ~N ~E, and N ~E. Now we can store some logical information. For example, if we learn no new books are English, then we would place a grey marker in the N E square. If we learn that there are some new non-English books, we would place a red marker in the N ~E square.

What I like about this scheme, partly, is that it is oriented towards making mental models of the situation, rather than oriented towards making statements in some formal language (the model for all modern logic). This same issue is addressed also by my datastructure, which is meant to reflect structures in a non-lingual (or at least less lingual) way.

The only rule of deduction within Lewis Carroll's system is elimination-- for example, if I know that some old books exist, but know that no old English books exist, then I can conclude that some old non-english books exist. (In order to store "some old books exist", Lewis Carroll would put a red marker between the two squares ~N E and ~N ~E. For my purposes, it seems more convenient to store it in a separate grid with only the squares N and ~N. Every grid of a certain number of variables would have up to two to that power (minus one) such smaller grids for storing more general pieces of information; but only if needed, of course.) Of course, since (unlike the system developed by Lewis Carroll) we will actually need to deal with instances of the Things, we have another sort of deduction: we take anything we know about a particular object, and use the information stored in the grids to try to deduce more. There are two strategies we can use when doing this: either "everything is possible unless we know otherwise", or "everything is impossible until proven otherwise". The first pays attention only to possibilities that have been ruled out, using them to eliminate possibilities for the particular object. The second pays attention only to possibilities known to have examples, admitting those possibilities for the example case. Thus, we have three truth-values for deduced facts: known impossible, known possible, and unknown.

With many variables that we'd like to examine, we can do better than true/false representations. For example, representing a pixel's color in true/false format would mean a huge number of dimensions: one yes/no question in the form "Is this pixel ___?" for every single color. In such cases, a "yes" for one of the question means a "no" for all of the others. This means the many variables can be combined to one (or, in the case of color, the customary three may be more convenient).

A logical rule, then, is represented as a conditional grid: limitations on existence and nonexistence that occur in given situations. Another way to think of this is as a partial grid; we store the information about just a particular slice of the grid, defined by limiting our interest to certain values of certain variables, and storing information about what happens to the other variables.

Now to get back to our central topic, the issue of learning. How does the system decide on it's own where to place those red and grey markers?

When deciding what's possible (trying to place red markers), we already have direct evidence of what exists and what doesn't exist. Given any number of variables, we can fill in combinations that have occurred from memory. This can be thought of as a sort of long-term memory; the short-term is the actual data, in which we can do as much looking around for patterns as we like, while the long-term only holds specific combinations of variables that have occurred. The more variables we record the memory under, the more details we will have. Once the short-term version has been deleted, we will only have access to what's been recorded. So the size of the grid that the mind generally stores memories on determines the detail of the memories. (Certain memories deemed more important might be stored on larger grids, of course.)

In concluding that certain combinations of variables are impossible (placing grey markers), we look at combinations that haven't appeared yet, and ask if they should have appeared statistically. The more variables we're looking at, the more evidence we need; we can conclude more quickly, for example, that a particular flower doesn't open when it's cloudy then that the flower never opens when it's cloudy, the ground is dry, and the temperature is below 60 degrees fahrenheit. Of course, if we can conclude the first, then the second is simply obvious. But otherwise, the second takes longer for us to conclude because the total combined event (flower opens + temperature is below 60 + ground is dry + it's cloudy) is less likely in the first place; we wouldn't expect to see it for some time even given random data.

Again, I'd like to develop the math behind this more fully. At the most basic level, we should expect to wait a number of instances equal to the number of possibilities to see one particular possibility given random data. However, this again fails to take into account the fact that we're searching for multiple such patterns; about half of the patterns we look for will be validated randomly with such an approach.

The type of complete grids we look at are generally the relation between adjacent pieces of data. For example, if our data is organized by time, we might want to look at the relationship between two adjacent moments, or possibly three or four. This means that we construct an abstract space representing the "universe" of all neighboring-moment pairs, and make statements about them using Lewis Carroll's method. For spatially organized data, we can look at more complex arrangements of neighbors; any shape that fits in the given number of dimensions. For example, we might look at circles (or spheres), squares (cubes), or nebulous and disconnected shapes, as we see fit. We could even look at situation-dependant shapes; for example, on a map with attached data, we might check for a relationship between towns that are "on the opposite side of a mountain". Some towns will have such a neighbor, and some will not; and which town is which town's neighbor will depend on the geographical context, not on a preset grid. Also, we don't need to look at all the stats for the neighbors we're examining; we might be looking at the relationship between one point's elevation and the moisture levels in surrounding points, for example.

The idea is that the computer should look through these different possible relations, starting with the simplest, and moving to the more complex as time allows. In practice, though, the number of neighbors that we look at at once is very limited, due to the expensiveness of tabulating the exploding number of points on the record-keeping grids. Each new variable we add in multiplies the grid size by it's number of possibilities, so we can only really look at a few.

But the idea of conditional grids, and our previously-formed object system, come to our aid. Rather then look at all possible situations, limiting ourselves to only a relatively small number of variables, we can use our previously created set of static patterns as a set of "important situations"-- situations that pop up again and again. We then look at any important relational effects that spring up in these situations. We can look for these effects at multiple levels; we might look for effects in the raw data surrounding the pattern, or we could look instead for effects in more abstract interpretations of the data as objects. If a static pattern has any variable patterns in it's makeup, we can look at these as if they were actual variables of basic data items, looking for relations between them or between them.

Also, this additional framework for logical relations allows for the concept of variable patterns without them being pre-existing in the system. As stated, variable patterns are defined by having similar contexts. The ability to conditionalize contexts lets us examine the context of an individual pattern. What we want to do is call two patterns "the same" if they have contexts that are similar in some way. If we view the relational grids as just more data, then we can find static patterns (and variable patterns and rules) in just the way we would in the basic data. Two patterns are "the same" if their grids pattern similarly. This also allows for overlapping classifications (like the "dog" earlier that could be classified as "word for pet" or "word for canine"); the various patternings in the grids can be complicated, and may have different sorts of similarities and differences.

This idea of treating the grids as data, I think, is quite powerful. If the grids formed from the basic data can be treated as data, then they have grids of their own, which are data and have data, and so on, as far as we please. This reaching upwards into greater abstraction should be not only a result of the search for pattern, but a driving force; for example, of such a higher pattern suggests that some particular combination of variables is impossible, despite it not being proven impossible in the conventional data-drivewn way, the system should still decide that that combination is probably not possible. Patterns developed at this level could perhaps initiate a search for lower-level proof.

Now we have two (heavily interacting) layers: objects (static and variable patterns) and rules (relational patterns). This allows us to take in data and build up predictive models of the data. However, it doesn't account for acting on these predictions. Some AI applications may not need this particular aspect to be developed, but it is needed in order to build a full, autonomous mind to create animal-like behavior.

This third layer of the system, I admit, is the least developed. Rather than describing a complete working architecture (even in the general way I've described the previous layers), I'll only be sketching a vague outline of some issues I'd like to address.

First, the third layer should be just as strongly intertwined with the previous two as they are with eachother. Mental life needs just as much directing as physical life. If left to themselves, the systems I've described have the essential behavior of a random search for patterns, moving from smaller to larger, and developing from more concrete to ever-higher levels of abstraction as the system builds on itself. The third layer should direct this patternfinding based on experience of what search strategies are most effective, and also based on what questions actually need answering in practice (what variables need predicting). Also, I've mainly been describing the learning side of the two lower layers. The application side (using the learned models to predict) is potentially just as complicated, but most of it's complication comes from interaction with the planning aspect of the third layer. The logic behind the predictions is fairly simple: predict the completion of partial objects, predict the fulfillment of learned relations. But the fulfillment of this can develop itself into complicated mental operations as more complicated objects and relations are learned. It should be able to develop the full complexity of, say, a theorem-proving program, but with a more flexible foundation.

Second, this layer has several basic tasks. It must (1) direct the two lower layers to try to come up with patterns for the consequences of actions, and experiment (act solely to find out what happens) to aid this search, (2) direct the lower layers to try to predict the best possible action in a given situation, and (3) direct the lower layers to find patterns concerning what actions achieve particular states (that we desire to achieve or avoid). These are action-based, start-state-based, and end-state-based searches. It must then (4) try to decide what actions to take at the moment based on minimizing pain/maximizing reward. In doing this, it must balance between time spent planning and time spent acting, based on how urgently action is needed.

Third, the top-level intelligent actor should not bear the burden of guiding every little action. Using the same essential mechanism that creates objects out of the data, it should create habits out of the actions. These habits should fill the role of lower-level actors in a chain of command at the top of which sits the central actor of the mind. They are semi-autonomous; they can act on their own, but only at the bidding of a higher actor. They may or may not be able to learn on their own. The top level first tackles each task itself, then finds regularities and creates habits to handle lower-level tasks for it. In this way, the highest level is always trying to move up in the level of abstraction it thinks in, boosted by more and more habits underneath.

Fourth, and finally, if the goal really is a "full, autonomous mind", then this chain of command should allow for mutiny. With a fixed top level, the system is chained to whatever definition of pain and pleasure it's first given. But if subgoals are allowed to overthrow the goals that they were intended to accomplish, then the system can define it's own concept of good and bad, independent of pain and pleasure; it likes things originally because those things help it accomplish it's goals, but then it can sometimes give up those goals and decide that the things that were only helpers are actually more important. This mechanism is very useful in humans because it allows us to go beyond our biological desires and causes us to be more likely to benefit the group; for example, we form attachments to people who help us, and are likely to help them later even when there is no "logical reason" to do so (based on self-motvated goals). (This is really a personal theory of ethics, but it's conceivable that it could be implemented in an AI.) This final feature would probably be excluded from many applications of the system, because it's preferable to have the intelligence stick to the orders it's given.

As I said in the beginning, these three layers may be presented in the opposite order. There are several ways that it could make sense in that order.

As a behavioral view: the first thing that becomes obvious about an intelligent system upon observing it is that it avoids pain and seeks pleasure. This is the goal-oriented level, and leads to ideas such as operant conditioning. Second, we realize that it is predicting the world around it (in order to avoid pain and seek pleasure). This is the rule-based layer, leading to research based in logic. Third, we realize that it has an internal representation of the world (that it uses to predict (in order to avoid pain and seek pleasure)). This is the object-level, associated with research into pattern recognition.

As an evolutionary view: the first level to emerge is direct behavior of an organism. Generally, the organism moves towards pleasurable things and away from painful things. After much evolution, and the invention of multicellular organisms and nerves, we can do some more interesting learning. At first, we don't have enough senses to deal with sensory objects, so all we can learn are rules of behavior for the world. But, eventually, we get senses complicated enough to benefit from objects. (The object level essentially relies on the fact that one bit of sensory input "maps onto" another with little change; for example, we can recognize a cat fairly well regardless of if it's image is in the center of the eye, to the left, to the right, above, or below. Similarly, we can tell a texture regardless of where on our skin we feel it.)

There are many modifications that may be made to this model. Because each part is quasi-independent, modifications can be made without compromising the whole. The object level could be extended in various ways. For example, the upWrite model admits a wider definition of "sub-objects" (the parallell of static objects) in which any mapping of a larger number of units into one unit can qualify; this includes (in particular) using gaussian spreads as objects that represent a number of points. The rules level could be altered quite extensively, perhaps admitting probabilistic methods such as bayesian networks or markov models. (As non-discrete objects, these would be harder to find patterns in, weakening the power of admitting grids as data; however, predictive performance would be improved by storing more exact probabilistic information about the data.) The behavior level needs to be further fleshed out and made exact, and could be pushed in many different directions. Overall, the framework could admit many different techniques of AI into one overall pattern.

## Thursday, December 21, 2006

Subscribe to:
Posts (Atom)