Thursday, December 21, 2006

As my ideas about AI get more detailed, I come up with more and more subdivisions of the same single concept. Currently, there's two halves to it: the basic datastructure that supports everything (allowing a computer to represent "fluid" thoughts that don't have definite borders, but rather form an interconnected web), and the overall architecture that any intelligent system needs to follow (or, at least, could benefit from) regardless of basic makeup. The purpose of this document is to present the overall scheme, disregarding the datastructure. Hopefully this will make the explanation more understandable (less technical). But, while reading, keep in mind that I'm only ignoring the problem of implementing these ideas on a computer to make them simple. It is a problem, not to be ignored, and I do have something of a solution.

There are two ways to present all of these ideas: starting with the internal (subjective reality inside the mind) and moving outwards, and starting with the external (behavior of an intelligent system) and moving inwards. Since the first follows more closely the order in which I came up with these ideas, I'll present them that way; but at the end I'll explain why it's perhaps more logical if you look at it the other way.

I started with the thought that all intelligence follows the basic design of looking for repeated patterns. I won't argue this point too far, since I'm just going to go and refute it later (thus flows the progression of ideas), but imagine an animal learning to see. Many repeated structures are already inborn; certainly the brain already knows how to pick out lines, and also the animal has some instinctual idea of what members of it's own species look like, and what predators look like. But from the basic building blocks of colors and lines, it must construct man more complicated repeated patterns, such as the various types of plant and animal, various types of terrain, places it recognizes, et cetera. Larger repeated patterns are constructed out of smaller ones; once the animal knows many of the patterns at a given level, it can easily start to look at patterns at the next level up. Each level is treated about the same by the mind; patterns of colors and lines are not fundamentally different than patterns of landscape. The basic parts are just more complicated. So every repeated pattern we find simplifies the picture, making it even easier to find another repeat.

Now, there are some obvious problems with treating all of this as just "finding repeated patterns". Foremost, all of the repeated patterns seem sloppy, so that it's hard to decide exactly what qualifies as (say) a line. But instead of abandoning this approach as hopeless, let's see how far it gets us, and then try to fix the problems by adding to the basic idea instead of by giving up and starting over.

The nice thing about treating everything as finding repeated patterns is that it's a fairly simple thing for simple-minded computers to do. All that's necessary is to count the number of times various sequences of small sizes show up, and when one shows up more than the rest, remember it as a single unit, give it a name ("X"), and then abbreviate the data by sticking "X" anywhere that the pattern shows up. (Sequences of bigger sizes are constructed of multiple smaller ones, so we don't have to look for sequences that are too large.) This is especially easy with one-dimensional data, since we can just replace the sequence with the name. With more complicated structures, like a two-dimensional picture, or a video (effectively 3-dimensional), we can have irregularly-shaped patterns in our data, so we can't just cut out squares (or cubes). But this issue strays into the realm of my datastructure, so I won't go into it.

Now, let's develop this a little further. We can come up with some mathematics for when to consider a repeat significant. Basically, we just want to know when a particular pattern is likely enough that, given the amount of data we've seen, we should expect it to crop up randomly the number of times we've counted it. If so, we can ignore it. But if we count it more times than we should have already, we pay attention.

The expected number of repeats for a given pattern is approximately the probability of that pattern occurring in data exactly it's size, (1/number of possibilities)^(size of pattern), multiplied by the size of the dataset. This is an overestimate, because the pattern couldn't actually fit in that many times; however, multiplying the maximum number that could fit would be an underestimate, and farther off the real number (I suspect). I'd like to figure out the exact formula, but it seems likely that it would be complicated enough to merit abbreviation anyway. But, to continue: this math will give us (approximately) the number of times a pattern should repeat. If the pattern repeats more, it's worth noticing. But it would also be nice if we could get a number by which we can compare the merits of different possible repeats, so that we can select the one that is most significant. The simplest would be (total number of repeats)/(expected repeats), which would equal (total repeats * size of data)/(number of possibilities ^ size of pattern). If this number is above 1, then the pattern is worth noticing; the higher, the more worth noticing. (I would like it if this number were a probability, instead of a number that wanders above/below 1, but again I don't know the proper math, if it exists.)

In addition to being only approximating, this method also ignores the number of patterns we're looking for. If we're only checking a single sequence to see if it works, then the math is (about) right. But if we're checking all possible sequences to see if they're useful repeaters, then a good number (about half?) are likely to be validated just by chance. So, technically, we should take into account the number of patterns we're checking for. This is another area where my math could use improving.

The point of doing all of this math is simply to come up with ideal behaviors we'd like our system too approach; I'd like to have better mathematics because it would set up a more accurate theoretical ideal, but in practice, the system may do approximate math, or may follow the math nearly without "doing math" at all (perhaps following emergent/biological models).

Comparison to other systems: At this point, the system matches up with certain techniques of data compression, in which sequences that repeat sufficiently often are given a shorter name. One major difference, however, is that those systems will recognize a repeating sequence whenever it makes the size of the file smaller, while I'm trying to use more "scientific" math that only recognizes sequences that are unlikely to have repeated so many times.

Now to address the problems to this approach. As mentioned, the major difficulty is simply one of flexibility. We'll look at this first, and our attempts at fixing it will uncover some other less obvious problems.

The basic idea is that, if recognizing only one possibility causes problems, we should recognize several.

If we call the old type of repeated pattern a "static pattern", we can call the new type a "variable pattern". A variable pattern is a set of static patterns that are thought of as "the same" in some sense. The way in which they are "the same" involves how they "act"-- where they tend to pop up. Essentially, we want to look at the context of each pattern; if two patterns have the same or similar enough recurring contexts, then they are merged into a single variable pattern. The likelihood that a group of patterns becomes merged depends on how well that group differentiates itself from the rest of the patterns. (I haven't developed math specifically for this, and though I'm not against it, it will become clear that I don't have to; the idea of a variable pattern will emerge all by itself from a more general system.)

Since a variable pattern is viewed as "the same thing", we can build static patterns that have variable patterns as pieces. For example, if a variable pattern were "vowel" (standing for the static pattern a, e, i, o, or u) then the static pattern "s, w, vowel, m" would match the three words "swim", "swam", and "swum" (along with the two non-words "swom" and "swem"). These new, more flexible static patterns can be combined into more variable patterns, which can be combined into even more complex static patterns, et cetera. We can nest the two in eachother as far as we like.

An aside: This handles "discrete" messiness, not "continuous" messiness. For example, it doesn't help in the least with finding patterns in digital photographs, in which we'd be hard-put to find two pixels with the exact same color value. Variable patterns can't help us too much because they can't get a grip on the situation; we can't easily say that two colors act "the same" based on what contexts they turn up in, because there's just so many possible contexts. Perhaps the variable pattern approach would eventually yield fruit simply because similar colors tend to be next to one another. But it's far more important to look at the relationships between colors; the contrasts; the gradients. I still argue that this has to come down to discrete ideas in oder for a mind to be able to handle it well; for example, in classifying a color, we compare it to other colors, and make greater-than/less-than type comparisons (which are discrete). This relates to datastructure again, but the system should be able to "see" the relation between adjacent colors about as easily as it "sees" the colors themselves. This greatly increases the applicability of discrete methods. Discrete features are essentially various types of discontinuity in the data.

Comparison to other systems: the system I've been describing thus far is notably very similar to a system called "upWrite", which I learned of after developing these ideas myself. This is described in the paper The UpWrite Predictor: A General Grammatical Inference Engine for Symbolic Time Series with Applications in Natural Language Acquisition and Data Compression by Jason L. Hutchens (1999).

Continuing: We now have an idea of a system that would take in data and build up larger and larger variable and static patterns. These patterns can then be used to generate random data that will be somewhat realistic, based on the data we've fed into it. But (of course) there's something missing. The system assumes that the choice between the various possibilities is totally random. If it were trying to construct a text, say, it might capitalize letters randomly, because it decided that it was useful to see "R" and "r" as the same thing. Also, variable patterns should be able to share elements; one variable pattern might be (cat, dog, hamster, fish) while another might be (dog, canine, mutt, hound). Both variable patterns are useful for different purposes.

We might make individual solutions for these two particular problems (and it certainly might be useful for certain purposes), but there is a larger problem looming. The idea of defining sensory objects can only take us so far. Given data, our ability to predict is only based on object-completion; if we see part of an object, we predict that the rest can be found. This can take us a certain distance-- given part of a jaguar, we predict the rest of a jaguar, and run. Given the first event in a sequence of events we've seen before, we predict the future. But it's not really a logically complete way of looking at the world. Don't get me wrong-- humans aren't logical. The idea that we prefer object-completion to rule-completion can explain a good number of "illogical" things we do. (In particular, "confirmation bias" comes to mind. When asked to check the truth of a logical rule "if A then B" (where A and B each stand for something that could be true or false in particular situations), a person will tend look only at situations in which A or B is true, and check to see if the other is true. For example, if a person is presented with a set of face-up playing cards, and told to test the statement "If a card has a mark on the back, then it is a diamond", that person would tend to check the diamond cards to see if they have marks on their backs. Logically, it's just as important to check non-diamonds-- if one of them has a mark on the back, the rule is false. But people check the diamonds first, and will often forget to check the others at all. This is illogical, but it makes sense if people are thinking in terms of objects instead of rules; the people are testing whether their object-completions based on the hypothetical object are correct.) Nonetheless, it's also useful to "conditionalize" the world-- break it up into particular situations. This is very similar to what we need to do to look at the "context" of a pattern when looking for variable patterns; we say "if there's this pattern here, what can I expect to be around it?". Furthermore, it addresses the problem of needing different variable patterns in different contexts-- replacing it with the idea of learning different sets of possibilities for those different contexts.

So: How do we learn logical rules? Well, what we have to deal with is objects with variables. Even if we just use the raw unprocessed data, that's what we've got; it could be pixels with variable color, it could be touch sensors with variable pressure. A logical rule is a relation between variables.

The "logic" I use is actually inspired more by a system of logic developed by Lewis Carroll before his death in 1898, based on term logic, then on modern-day propositional logic or predicate calculus. He first published this in his book "Game of Logic" in 1887, then in a more complete version as "Symbolic Logic" in 1897. The basic system is to represent a Universe of Things using a grid with one dimension for each quality which may be true or false of a particular Thing. (He had a rather clever way of drawing out up to 10 true/false dimensions on a flat sheet of paper, which he gave as an alternative to Venn Diagrams.) A blank grid communicates to us no information about the Universe under discussion. (He capitalized Universe, Thing, and several other important terms in his book.) Each spot on the grid represents all Things under discussion that have the given combination of properties. Statements are made using markers on the grid: a red marker meant "There exists at least one Thing with these properties", and a grey marker meant "There does not exist any Thing with these properties". He gives an example of the universe of books, examining the properties "new" and "english". A grid with four locations results-- new and english, new and not english, not new and not english, and not new, but english. If we give the letter N to "new" and the letter E to "english", using the symbol ~ for "not", we get much shorter names: N E, ~N E, ~N ~E, and N ~E. Now we can store some logical information. For example, if we learn no new books are English, then we would place a grey marker in the N E square. If we learn that there are some new non-English books, we would place a red marker in the N ~E square.

What I like about this scheme, partly, is that it is oriented towards making mental models of the situation, rather than oriented towards making statements in some formal language (the model for all modern logic). This same issue is addressed also by my datastructure, which is meant to reflect structures in a non-lingual (or at least less lingual) way.

The only rule of deduction within Lewis Carroll's system is elimination-- for example, if I know that some old books exist, but know that no old English books exist, then I can conclude that some old non-english books exist. (In order to store "some old books exist", Lewis Carroll would put a red marker between the two squares ~N E and ~N ~E. For my purposes, it seems more convenient to store it in a separate grid with only the squares N and ~N. Every grid of a certain number of variables would have up to two to that power (minus one) such smaller grids for storing more general pieces of information; but only if needed, of course.) Of course, since (unlike the system developed by Lewis Carroll) we will actually need to deal with instances of the Things, we have another sort of deduction: we take anything we know about a particular object, and use the information stored in the grids to try to deduce more. There are two strategies we can use when doing this: either "everything is possible unless we know otherwise", or "everything is impossible until proven otherwise". The first pays attention only to possibilities that have been ruled out, using them to eliminate possibilities for the particular object. The second pays attention only to possibilities known to have examples, admitting those possibilities for the example case. Thus, we have three truth-values for deduced facts: known impossible, known possible, and unknown.

With many variables that we'd like to examine, we can do better than true/false representations. For example, representing a pixel's color in true/false format would mean a huge number of dimensions: one yes/no question in the form "Is this pixel ___?" for every single color. In such cases, a "yes" for one of the question means a "no" for all of the others. This means the many variables can be combined to one (or, in the case of color, the customary three may be more convenient).

A logical rule, then, is represented as a conditional grid: limitations on existence and nonexistence that occur in given situations. Another way to think of this is as a partial grid; we store the information about just a particular slice of the grid, defined by limiting our interest to certain values of certain variables, and storing information about what happens to the other variables.

Now to get back to our central topic, the issue of learning. How does the system decide on it's own where to place those red and grey markers?

When deciding what's possible (trying to place red markers), we already have direct evidence of what exists and what doesn't exist. Given any number of variables, we can fill in combinations that have occurred from memory. This can be thought of as a sort of long-term memory; the short-term is the actual data, in which we can do as much looking around for patterns as we like, while the long-term only holds specific combinations of variables that have occurred. The more variables we record the memory under, the more details we will have. Once the short-term version has been deleted, we will only have access to what's been recorded. So the size of the grid that the mind generally stores memories on determines the detail of the memories. (Certain memories deemed more important might be stored on larger grids, of course.)

In concluding that certain combinations of variables are impossible (placing grey markers), we look at combinations that haven't appeared yet, and ask if they should have appeared statistically. The more variables we're looking at, the more evidence we need; we can conclude more quickly, for example, that a particular flower doesn't open when it's cloudy then that the flower never opens when it's cloudy, the ground is dry, and the temperature is below 60 degrees fahrenheit. Of course, if we can conclude the first, then the second is simply obvious. But otherwise, the second takes longer for us to conclude because the total combined event (flower opens + temperature is below 60 + ground is dry + it's cloudy) is less likely in the first place; we wouldn't expect to see it for some time even given random data.

Again, I'd like to develop the math behind this more fully. At the most basic level, we should expect to wait a number of instances equal to the number of possibilities to see one particular possibility given random data. However, this again fails to take into account the fact that we're searching for multiple such patterns; about half of the patterns we look for will be validated randomly with such an approach.

The type of complete grids we look at are generally the relation between adjacent pieces of data. For example, if our data is organized by time, we might want to look at the relationship between two adjacent moments, or possibly three or four. This means that we construct an abstract space representing the "universe" of all neighboring-moment pairs, and make statements about them using Lewis Carroll's method. For spatially organized data, we can look at more complex arrangements of neighbors; any shape that fits in the given number of dimensions. For example, we might look at circles (or spheres), squares (cubes), or nebulous and disconnected shapes, as we see fit. We could even look at situation-dependant shapes; for example, on a map with attached data, we might check for a relationship between towns that are "on the opposite side of a mountain". Some towns will have such a neighbor, and some will not; and which town is which town's neighbor will depend on the geographical context, not on a preset grid. Also, we don't need to look at all the stats for the neighbors we're examining; we might be looking at the relationship between one point's elevation and the moisture levels in surrounding points, for example.

The idea is that the computer should look through these different possible relations, starting with the simplest, and moving to the more complex as time allows. In practice, though, the number of neighbors that we look at at once is very limited, due to the expensiveness of tabulating the exploding number of points on the record-keeping grids. Each new variable we add in multiplies the grid size by it's number of possibilities, so we can only really look at a few.

But the idea of conditional grids, and our previously-formed object system, come to our aid. Rather then look at all possible situations, limiting ourselves to only a relatively small number of variables, we can use our previously created set of static patterns as a set of "important situations"-- situations that pop up again and again. We then look at any important relational effects that spring up in these situations. We can look for these effects at multiple levels; we might look for effects in the raw data surrounding the pattern, or we could look instead for effects in more abstract interpretations of the data as objects. If a static pattern has any variable patterns in it's makeup, we can look at these as if they were actual variables of basic data items, looking for relations between them or between them.

Also, this additional framework for logical relations allows for the concept of variable patterns without them being pre-existing in the system. As stated, variable patterns are defined by having similar contexts. The ability to conditionalize contexts lets us examine the context of an individual pattern. What we want to do is call two patterns "the same" if they have contexts that are similar in some way. If we view the relational grids as just more data, then we can find static patterns (and variable patterns and rules) in just the way we would in the basic data. Two patterns are "the same" if their grids pattern similarly. This also allows for overlapping classifications (like the "dog" earlier that could be classified as "word for pet" or "word for canine"); the various patternings in the grids can be complicated, and may have different sorts of similarities and differences.

This idea of treating the grids as data, I think, is quite powerful. If the grids formed from the basic data can be treated as data, then they have grids of their own, which are data and have data, and so on, as far as we please. This reaching upwards into greater abstraction should be not only a result of the search for pattern, but a driving force; for example, of such a higher pattern suggests that some particular combination of variables is impossible, despite it not being proven impossible in the conventional data-drivewn way, the system should still decide that that combination is probably not possible. Patterns developed at this level could perhaps initiate a search for lower-level proof.

Now we have two (heavily interacting) layers: objects (static and variable patterns) and rules (relational patterns). This allows us to take in data and build up predictive models of the data. However, it doesn't account for acting on these predictions. Some AI applications may not need this particular aspect to be developed, but it is needed in order to build a full, autonomous mind to create animal-like behavior.

This third layer of the system, I admit, is the least developed. Rather than describing a complete working architecture (even in the general way I've described the previous layers), I'll only be sketching a vague outline of some issues I'd like to address.

First, the third layer should be just as strongly intertwined with the previous two as they are with eachother. Mental life needs just as much directing as physical life. If left to themselves, the systems I've described have the essential behavior of a random search for patterns, moving from smaller to larger, and developing from more concrete to ever-higher levels of abstraction as the system builds on itself. The third layer should direct this patternfinding based on experience of what search strategies are most effective, and also based on what questions actually need answering in practice (what variables need predicting). Also, I've mainly been describing the learning side of the two lower layers. The application side (using the learned models to predict) is potentially just as complicated, but most of it's complication comes from interaction with the planning aspect of the third layer. The logic behind the predictions is fairly simple: predict the completion of partial objects, predict the fulfillment of learned relations. But the fulfillment of this can develop itself into complicated mental operations as more complicated objects and relations are learned. It should be able to develop the full complexity of, say, a theorem-proving program, but with a more flexible foundation.

Second, this layer has several basic tasks. It must (1) direct the two lower layers to try to come up with patterns for the consequences of actions, and experiment (act solely to find out what happens) to aid this search, (2) direct the lower layers to try to predict the best possible action in a given situation, and (3) direct the lower layers to find patterns concerning what actions achieve particular states (that we desire to achieve or avoid). These are action-based, start-state-based, and end-state-based searches. It must then (4) try to decide what actions to take at the moment based on minimizing pain/maximizing reward. In doing this, it must balance between time spent planning and time spent acting, based on how urgently action is needed.

Third, the top-level intelligent actor should not bear the burden of guiding every little action. Using the same essential mechanism that creates objects out of the data, it should create habits out of the actions. These habits should fill the role of lower-level actors in a chain of command at the top of which sits the central actor of the mind. They are semi-autonomous; they can act on their own, but only at the bidding of a higher actor. They may or may not be able to learn on their own. The top level first tackles each task itself, then finds regularities and creates habits to handle lower-level tasks for it. In this way, the highest level is always trying to move up in the level of abstraction it thinks in, boosted by more and more habits underneath.

Fourth, and finally, if the goal really is a "full, autonomous mind", then this chain of command should allow for mutiny. With a fixed top level, the system is chained to whatever definition of pain and pleasure it's first given. But if subgoals are allowed to overthrow the goals that they were intended to accomplish, then the system can define it's own concept of good and bad, independent of pain and pleasure; it likes things originally because those things help it accomplish it's goals, but then it can sometimes give up those goals and decide that the things that were only helpers are actually more important. This mechanism is very useful in humans because it allows us to go beyond our biological desires and causes us to be more likely to benefit the group; for example, we form attachments to people who help us, and are likely to help them later even when there is no "logical reason" to do so (based on self-motvated goals). (This is really a personal theory of ethics, but it's conceivable that it could be implemented in an AI.) This final feature would probably be excluded from many applications of the system, because it's preferable to have the intelligence stick to the orders it's given.

As I said in the beginning, these three layers may be presented in the opposite order. There are several ways that it could make sense in that order.

As a behavioral view: the first thing that becomes obvious about an intelligent system upon observing it is that it avoids pain and seeks pleasure. This is the goal-oriented level, and leads to ideas such as operant conditioning. Second, we realize that it is predicting the world around it (in order to avoid pain and seek pleasure). This is the rule-based layer, leading to research based in logic. Third, we realize that it has an internal representation of the world (that it uses to predict (in order to avoid pain and seek pleasure)). This is the object-level, associated with research into pattern recognition.

As an evolutionary view: the first level to emerge is direct behavior of an organism. Generally, the organism moves towards pleasurable things and away from painful things. After much evolution, and the invention of multicellular organisms and nerves, we can do some more interesting learning. At first, we don't have enough senses to deal with sensory objects, so all we can learn are rules of behavior for the world. But, eventually, we get senses complicated enough to benefit from objects. (The object level essentially relies on the fact that one bit of sensory input "maps onto" another with little change; for example, we can recognize a cat fairly well regardless of if it's image is in the center of the eye, to the left, to the right, above, or below. Similarly, we can tell a texture regardless of where on our skin we feel it.)

There are many modifications that may be made to this model. Because each part is quasi-independent, modifications can be made without compromising the whole. The object level could be extended in various ways. For example, the upWrite model admits a wider definition of "sub-objects" (the parallell of static objects) in which any mapping of a larger number of units into one unit can qualify; this includes (in particular) using gaussian spreads as objects that represent a number of points. The rules level could be altered quite extensively, perhaps admitting probabilistic methods such as bayesian networks or markov models. (As non-discrete objects, these would be harder to find patterns in, weakening the power of admitting grids as data; however, predictive performance would be improved by storing more exact probabilistic information about the data.) The behavior level needs to be further fleshed out and made exact, and could be pushed in many different directions. Overall, the framework could admit many different techniques of AI into one overall pattern.

Friday, October 13, 2006

More Metamodels

The last post discussed why we can't build an ultimate solution to AI-- an ultimate model of how to model the world. I've got to admit that I don't particularly like the fact. I want every element of my AI to follow logically from the basic purpose-- when I describe it, I want it to sound like the obvious solution, the right way of doing things. So, although we can't say what model for modelmaking is best, let's discuss some basic elements that model-models should have.

Obviously, the more types of models we can generate with our modelmaking model, the better. That's why AIXI and OOPS use turing-machines; technically, they can do anything. So a good modelmodel should be universal in the sense that it can come up with any particular model.

Second, we need some rating for how good each model is. Since we can come up with any model, there will be a large number of possibilities for each set of data. There are several ways to rate models. For genetic algorithms and neural nets, a common concern is how closely we match the data-- we give each model a score for percent correct on a number of problems. As I explained in the previous post, AIXI and OOPS both use the criteria of 100% matching; no errors are to be tolerated. But this criteria alone did not limit them to only one model. Similarly, with genetic algorithms and neural nets we run into trouble if we only use percent correct. The model with the highest score may be overmatched: it may have memorized the answers for our particular test, but have no idea what to do for any other. The equivalent for AIXI and OOPS would be a program that generates the desired output by quoting it in it's program-- if the data is "1001011010010101", then the program is "say 1001011010010101". No structure has actually been found in the data, or perhaps only an extremely small amount of structure ("say 100101101001 and then stick 01 on the end twice"). In some sense, these models are too complicated. AIXI and OOPS differ in how they define this notion of "complicated", of course. Neural nets and genetic algorithms have a different often-used solution: instead of giving the entire dataset to the program for it to train itself on, we give it only some of the data. Then once it's trained, we test it using the data that we didn't give it-- if it's just memorized the first dataset, it won't do too well on the new data.

I call these two different measures, the percent correct and the measure of memorization, "utility" and "elegance". "Utility" measures how well the model fits the data, and "elegance" measures how likely the model is to work well for other data in addition to the original training set. The term elegance makes more sense when applied to the AIXI length prior and the OOPS speed prior than for the method of withholding some data until the end, and it's true that I'm biased towards the second sort of method-- a measure of elegance based on what the model looks like, how 'nice' a model it is, not based on withholding some data.

Thursday, October 12, 2006


The basic problem in AI is to create a model for how the mind works.

The basic problem for the mind is to create models of the world.

Therefore, the basic problem in AI is to create basic models of how models work. In other words, we've got to figure out how to predict the world based on our limited knowledge of it. The basic problem is: "How do we make models?"

If there is a universal grand solution to AI, then it will be a universal way of making models and applying them to form predictions and to act on those predictions. There have been claims to this accomplishment already. Two such claims are AIXI (see "A Gentle Introduction to the Universal Algorithmic Agent AIXI", Marcus Hutter, 17 January 2003) and OOPS (see "The New AI: General and Sound and Relevant for Physics", Jurgen Schmidhuber, November 2003). Both make the following assumptions: a model is essentially a turing machine program, and a model should reproduce the data perfectly. Apart from that, the two are very different.

AIXI uses the "length prior", which assumes that the shortest turing machine program that can reproduce the data exactly is the best model. Because of this, AIXI requires an infinite amount of time to find this program; it cannot know if there exists some very short program that merely takes a very long time to calculate the output unless it waits for all possible programs (each possible combinations of ones and zeros, each being executed as if they were programs) to run their course.

OOPS uses the "speed prior" instead, which means that it assumes that the shortest program based on calculation time, not code length, is the best model. It is easy to see that there is a sort of loose correspondence between the two; shorter code will generally take less time to execute. But they are vastly different. The speed prior makes the optimal model much easier to find: we run all possible programs in parallel, starting with "1" and "0", then branching "1" to "10" and "11" and branching "0" to "01" and "00", et cetera. As we grow models in this way, some branches fall behind in creating data and some speed ahead; this is because, as we execute the branches, the programs jump around their instruction sets (the ones and zeros) and create data sort of at random, whenever the instruction set says to. Whenever one of these branches makes an output that conflicts with the data, the branch gets pruned. (remember, both AIXI and OOPS require models to conform exactly to the data.) The branch that gets to the finish line first, having reconstructed the entire dataset from it's program, wins. It is selected out of the bunch as the proper model. This may take a while if the dataset is large, but we are still much better off than with AIXI. (Note: there is more to OOPS than the speed prior. It is also a framework for creating smaller programs that get integrated into larger programs that get integrated into larger programs... et cetera. "OOPS" stands for "Optimal Ordered Problem Solver", which is meant to reflect the idea of using the speed prior first on small problems, then on larger and larger problems, integrating the smaller solution-programs as possible substructures of the larger programs, which (to abbreviate the actual methods extremely) are considered as additional options other than "1" and "0". So the more OOPS has learned, the larger and more sophisticated it's programs. But we will ignore this additional structure, because it is irrelevant to the discussion.)

Before I try to point out problems in these methods in particular, let's examine some general ideas. Both of these methods are general models of the model-making process-- not models of the world, but rather models of models of the world. Which one is right? Does it make any sense to ask the question?

Basically, what the question asks is "Which one produces better models"? The one that produces better models is obviously the better one, and if it actually produces better models than any other possible model of models, it is the best, and can be called the True Model of Models-- the universal solution to AI.

Also, we've got to ignore the fact that AIXI takes an extremely long time to calculate. This only proves that if we don't have sufficient resources, the best we can do is OOPS. AIXI might still be the Ultimate Model of Models; OOPS would merely be the easy-to-calculate alternative.

So, can we figure out which is better? Which matches the way the world works most closely? Which will make better models of our world?

This is actually a very strange question. Which one approaches the "real" model of the world most quickly? Which one will more easily happen upon the Ultimate Truth Behind Everything? Both AIXI and OOPS look for the program that will generate the universe-- the first using the length prior, the second using the speed prior.

The length prior assumes that the Ultimate Truth is a short program. The speed prior assumes that it's a fast program. Or, to put it another way, the length prior assumes that the universe was generated randomly, but favoring the short; the speed prior assumes it was generated randomly, but favoring the quick.

Now, we don't know the ultimate program for the universe. But we know a few things about it-- essentially, we know that it generally seems to follow some laws of physics that we've accumulated. So from that, we can start to look at which prior is better. What we've got to do is think of these laws as generated randomly, and try to figure out how they were generated-- we've got to try to come up with a pattern behind the physical laws.

Now, to do this in an objective way, we've got to have a scientific way of distinguishing between good and bad patterns. We can't go around suggesting models for the physical laws at random; we've got to have some criteria... a model that tells us which patterns are right and wrong.... a model to help us make models...

But wait! This means that to judge between different models of how to make models, we've got to already have a model for making models!! We can't decide which one is better unless we decide which one is better!

And that ends my narrative. For now.

Wednesday, September 27, 2006

Much Later, Much Better

AI Formalization

1. A Thing is a symbol containing a list of relations. Relations can be affirmed or denied in this list; those that are not either affirmed or denied are considered unknown. The list of relations is thus in the form of pairs and denials of such pairs; the first element of these pairs is the relation type, and the second is the value of the relation. It may also be allowable to list only the relation type, leaving the value unknown. The value of a relation is generally also a Thing, the symbol for which is given in the ordered pair. The relation points to the value, so that the whole group of Things might be thought of as a network of dots and arrows of different types.

2. Relation-types are represented as Things. Thus, different relation-types may be related. For example, one relation may be the opposite of another, meaning it points in the other direction. "Opposite" would be a relation between relations.

3. The following count as variables of a thing:
a. The relations possessed by that Thing
b. The variables of that relation
c. The values of those relations
d. The variables of those values
The possible values that these variables may take on can be called "states" (to differentiate from "values", which are defined more specifically as the Things pointed to in a relation). This recursive definition allows us to construct arbitrarily long variables that may be in different states. All the variables of any Things that a Thing is connected to are a variable of that Thing.

4. A Space of Things is defined as a Thing having one particular relation connecting it to all members of itself. (This can be any relation we choose.) Often, these members will have a special class of relations called "spatial relations" interconnecting them. If not, they are an unordered space. If they do have spatial relations, however, they may be of any form: the space could be 1-dimensional, 2-dimensional, or any number of dimensions; it could form a tree with any branching-number or no particular branching-number; it could be a tangled hierarchy; it could be a strange loop; it may be anything. The spatial relations can be stored as another special relation of the space, since relation-types are Things, and can be pointed to. (You can call this second special relation spatial-relation-ship.) The values of the spatial properties of a Thing can be called "neighbors" of that Thing; they are other Things within the space.

5. We define the properties of the Things in a space as those variables we are concerned with modeling in that space. The properties which are relations or relation values of a particular Thing can be called the "personal properties" of that Thing, while the properties that are inherited from neighbors might be called "contextual properties". Like the variables of a Thing, the properties of a Thing are defined recursively: any property of a neighbor can also be called a property of the Thing itself. The relations which point to the personal properties of a Thing can be called "Property-relations". The property-relations of a space can be stored in a third special relation of a space. (You could call it spatial-property-ship.)

6. We define a "data space" as an open collection of Things (meaning a space in which we know certain Things dwell, but in which other Things may also dwell, that we don't know about at the moment, but may know about later, as more data comes in) having defined spatial relations and property relations.

7. We define a class of Things as a Thing that organizes other Things into a set using an "instance" relation. A closed class is one determined completely by what members it has; no members can be added to or taken away. (Also known as an extensional class.) An open class is defined instead by the properties of it's members, meaning it can be added to when new Things are found that satisfy those properties. (Open classes are also known as intentional classes.) In order to define the properties of an open class, we refer to a template object. A template object is a Thing (not in the data) possessing all desired properties for class-membership. We can define a property positively or negatively, by affirming it or denying it for the template-object; if something has all the positive properties and none of the negative ones, it may be added to the list of instances. (Because we list specifically that something doesn't have a particular relation-value pair, and leave the matter undecided if it is not listed explicitly, we must both see that all positive requirements are met and see that all negative requirements are specifically negated. We cannot assume that a Thing meets all requirements just because it doesn't specifically list a negative value as something it has; if the property is unknown, it is unknown.)

8. We act as if all closed classes for a space already exist, meaning we can make open classes of closed classes with particular properties and search for such closed classes. For example, we can define the class "all pairs of Things which point to eachother with relation X". This is actually an open class of all closed classes that contain two things, each of which points to the other with whatever relation is specified as relation X. In this use, it is convenient to call closed classes "aggregates"; we are searching over all possible aggregates for the particular aggregates which meet our demands. To this end, we need to define templates that contain more than one template-objects; this allows us to talk in the abstract about different Things being related to one another. Two Things that point to one another would need to be represented as a template with two objects and two relations; without the ability to form such multi-object templates, it would be impossible to represent this. A class of one object can be thought of as a special case of the multi-object class. Each instance of a multi-object class is represented by a Thing that points to the Things in the data that fit the template-objects. These Things can be called the "parts" of the class instance. Each part is connected to the class-instance by a special relation with the same name as that given to the Thing standing in that place in the class template; thus, we give a unique name to every part-relation. A class formed by multiple parts may be called an aggregate class.

9. A possibility-space on some space is defined as a space of mutually exclusive classes-- classes that do not overlap by admitting the same members as eachother. A well-defined possibility space is one that is formed by always specifying things about the same variables or combinations of variables. For example, if we have 2 property-variables in a space that we want to form possibility-spaces for, we can form a well-defined possibility-space by forming classes that only select a value for the first variable, or only for the second, or that always select a value for both variables. I could form possibility-spaces that are not well-formed by throwing in, say, "value A for var1", "value D for var2", and "value B for var1, value C for var2". These do not overlap, but the space is not well-defined because we do not know in general how to add more classes to the space in a systematic way. To notate which variables a well-defined space uses, we add them to it's list of property-variables.

10. In a well-defined possibility-space, we call negated classes (classes known not to be in the possibility-space) "impossible", and the meaning of the negation is to signify that this value or combination of values cannot occur in the data. We say that affirmed classes (classes stated to be in the possibility-space) "exist", and the meaning of the affirmation is to say that an actual example of the class has been found in the data. The classes not talked about are "Possible but unknown", meaning that as we get more data we do not know if an example will turn up or not.

11. A single-state variable is one that we only want to take on one value at a time; for example, a pixel is only one color at a time. (An example of a multi-state variable might be "child"; one person may have several children, but if we wanted to represent that here we wouldn't use a "children" relation to link to a list of children, rather we would use a "child" relation to link to each child. Thus the "child" relation for a person with several children would take on multiple values.) We can restrict a particular relation to one value by adding to the possibility space a negated class that has two of the relation. Since that class will match to any Thing with more than one of the relation, it has the meaning "there is no class with more than one of this relation".

12. A possibility-space is closed if we have a limited number of possible states for the variables involved; in other words, we have negated all but a finite set of states. We can make a possibility-space closed for a single-state variable by stating the non-existance of a class that, as it's property-list, negates every value that we consider possible for that variable.

13. A well-defined possibility-space that restricts each of it's variables to one value in some finite range of values may be called a regular possibility-space. The rest of the discussion will be (perhaps unfortunately) restricted to these.

14. A model of a data-space is defined as a set of regular possibility-grids over all of the property-variables for that data-space, with dependence-information for each multivariable possibility-space. A default model for a data-space is one which includes an individual well-defined possibility-space for every single variable, with no multivariable possibility-spaces, and has no affirmed or negated members of any of the possibility-spaces (they are all empty, meaning everything is possible but unknown). More complex models are built up by affirming and negating possibilities, and also by creating multivariable possibility-spaces. When we create multivariable grids, we can bring in contextual properties of a Thing rather than just the personal properties; a Thing's personal properties can be limited by it's surroundings. When a multivariable possibility-space is created, we need to add dependance-information; this means we need to decide which variables are chosen first, and which are chosen based on those first choices. If there are any limitations on the variables, then choosing a state for one variable will tend to narrow the options for the other. This is important in the things to come. Either each variable for a grid is chosen in a particular order, so that we have a first, second, et cetera, or we can group certain variables together and choose them simultaneously, meaning that we choose from all possible combinations of the two instead of choosing them one at a time. The default for dependancy is choosing all variables at once in this way. Contextual properties are not included in this order; they are treated as pre-existing, already chosen. (note: it would work fine to allow contextual properties to be dependent on personal properties, but it would be more work to calculate, and we should be able to represent all possible relationships without the added representational power. In other words, all situations in which an outside variable depends on a personal variable can be broken down into the more standard type of relationship by looking at it from the other Thing's point of view.)

15. The probability of a space given a model of that space is calculated as the product of the probabilities of each Thing in that space, which is given by the model; we calculate from the model by looking at the probability of each variable given the actual selections of the variables they are dependent on, including contextual variables (if any dependance on context has been noted). If variables are supposed to be chosen simultaneously, then we record the probability of their particular joint value (out of all available values). This is relatively simple. But in actuality, there are two possible ways to perform this calculation: we can define "all possible states" as all states that are not negated in the possibility space, or as all states that are affirmed. The second will often be much more strict than the first (and therefore give higher probabilities). The first I call the narrow interpretation of a possibility-space, the second the broad interpretation. The two methods are ultimately impossible to reconcile, because they represent different ways of thinking, and different assumptions about the world: the first represents the assumption that anything is possible unless proven impossible, and the second represents the assumption that everything not proven possible must be impossible. These two ways of thinking will be developed as semi-semetrical opposites in the following development; no attempt will be made in the current document to suggest which type of reasoning is better in what case, and the only attempt at unification I can currently put forward is the hope of reducing the unknown by affirming/negating more of the possibilities, so that the two interpretations of a model approach eachother.

16. An existence-model is a model made with no negations, only affirmations, in it's possibility-spaces; it is interpreted narrowly, by seeing only affirmed classes as possible. I also use the term to refer to just the affirmative part of mixed models, particularly when applying the narrow interpretation to them. An existence-model reasons about what we know exists, using the assumption that anything else does not. It has the flaw of ruling out too many possibilities.

17. An impossibility-model is a model constructed only by negating certain classes, interpreted broadly (by assuming that only negated classes are impossible). I also use the term to refer to just the negative part of mixed models, particularly when applying the broad interpretation to that model. An impossibility-model reasons about what we know to be impossible, using the assumption that everything we can't prove impossible is a possibility. It has the flaw of assuming too many possibilities.

18. The utility of a model is defined as the probability of the data given that model. The more likely a model makes the data, the more useful that model is. If a model makes the data 100% probable, then it has the highest possible utility: 1. If it makes the data impossible, then it has the lowest possible utility: 0.

19. Utility is not a complete measure of how good a model is, because it is easy to find the model that gives the data 100% probability: we simply define the current data to be the only data possible. (This can be either an existence model with one large aggregate class, or an impossibility model that rules out everything else.) This model is obviously very bad. So we need a second measure: elegance. The simpler a model, the more elegant it is. The 100% model is inevitably exactly as complicated as the data itself; this is horribly bad elegance. Elegance is actually quite easy to define: it is the probability of the possibility-space itself, under the default model. If there are 10 states total in the possibility-space, each new negation or assertion we add gets 1/10 probability, because it's calculated as if it's traits were chosen randomly. (If there are multiple variables, it comes out to the same probability: 5 states times 4 states = 20 possible states, and 1/5 probability for one variable's choice times 1/4 probability for the other = 1/20 total probability for each class negated or asserted). Thus the total elegance of a possibility-space equals the number of states to the power of the number of classes mentioned in the possibility-space, and the elegance of a model is the product of the elegance of each possibility-space involved. (A possibility-space with no negations or assertions is totally elegant.)

20. Since a possibility-space is treated just as a normal space is in many ways, including how the probability is calculated, we can make models of it in the same way as well. If a space is patternistic, we can increase it's probability by making a model of it that has a high utility-- a model that increases the probability of the space. Similarly, we can increase the elegance rating of a space by recognizing patterns in it. Models we make of models increase the probability (elegance) in exactly the same way models increase the probability of the data. In other words, in this system, models ARE data-- they can be reasoned about in the same way. A model of a model is called a metamodel.

21. The probability that any particular model is true is the elegance of that model multiplied by it's utility. (if we've made a second model modeling the first, we multiply in it's elegance to this measure. If we've made a third, it goes in, as well. Et cetera.) This is a fundamentally unjustifiable step, in the sense that we cannot ever really hope to know if any particular model is actually true or false-- even assigning it a probability requires a big assumption about how the world works, and about the typicality of the data you're getting. But this step is necessary for the generation of new knowledge. The remainder of this document discusses how to use this fundamental assumption to search for new and better models. The essential goal is to maximize the probability of the model.

22. For limitation-models, the default is a blank possibility-space. The goal is to increase the probability of the data by adding limitations to that space, while avoiding inelegance by eliminating too many too specifically. Eliminating possibilities reduces possibilities for the data, making the current data more likely; but if we ever eliminate any currently existing possibilities, the data drops to 0% probability and we must abandon that model. This can happen when new data comes in that contradicts our model of the old data. The two basic courses of action are to check each possibility in a particular space as a candidate for elimination, and to search for other possibility-spaces that may be able to contain better models. Better spaces are arrived at by finding a relationship between two personal properties or between a personal property and a contextual property, in which some value of one property limits the possible values the other can take on. We can then represent this limitation by way of a possibility-space. (Note: variable dependance also should be a part of a model. The default is everything being independent. Other possibilities simply have the effect of changing the probabilities of the various classes involved in certain situations. If the distributions match better, they should be adopted.)

23. Existence models are somewhat less intuitive.The default is still an empty space, but this time that means no assertions, which means nothing is yet considered possible. So if we have any data at all, then the default model has a 0% probability because it rules out things that happen in the data, giving model absolutely no utility. To rectify this, every existence model must at the bare minimum accept all situations that actually occur as possibilities. A second strange property of existence models is that, whereas impossibility-models tend to try to get higher utility by compromising some elegance, existence-models tend to do the opposite: compromise some utility for greater gains in the area of elegance. New models are made by making generalizations about which classes exist; in other words, by making a model of the model (a "metamodel", as I said earlier). This increases the elegance by coming up with a more abstract classification of what exists and what doesn't, but often compromises some utility by considering more situations possible. The only other type of progress is finding better grids for which the relationship between the variables is more definite, just as with impossibility-grids. (Again, variable-dependance can be included.)

And that's it. Once models have been created, they can be used for deduction and guidance of goal-oriented behavior. (This isn't trivial, since I want the system to guide it's own thoughts, but a more naive non-self-guided version does follow quite simply.) The theory still has a few holes in it; some things are described above in a fairly sketchy manner, and other things are described in a more limited domain than that of all possible problems (the most obvious restriction being the restriction to regular possibility spaces). But this is the general idea. Better descriptions of what's here, along with improvements to the overall system, will of course be forthcoming.

Monday, July 17, 2006

Current State of Ideas

I have now finished resolving some issues in my system. Here is a short presentation of the current state. In terms of the older system (which had the four parts "static patterns", "variable patterns", "correlations", and "functions of data") this is essentially a completed version of stage 4, constructed in such a way as to allow the other stages to come naturally from it (rather than using it as a catch-all last stage that could do "anything else").

1. Things with Relations:

Each Thing has a unique identity, and can be further distinguished by possessing Relations of various types. These relation-types are arbitrary, created based on what's being modeled. A relation points to another Thing, and that Thing that it points to is called the Value of the relation.

2. Classes:

A Class is a defined by a set of constraints on a Thing or set of Things. An Instance of a Class is a thing or set of things which satisfies these Constraints. A set of Constraints is defined by a list of Properties of a thing, or negations of such properties, connected by 'and' or 'or'. Properties may be:

-the Identity of a Thing (so that we could define "the class of all things that are this, this, or this"; in other words, we can define classes by explicitly listing elements.)

-the Identity of relations possessed by a thing (so that we could define "the class of all things which have children", for example.)

-Any properties of the value of a relation of a given type (including the identity of the value, so we could define "all children of Joe", other relations possessed by the value, so we could define "all children of people with jobs", or any other properties)

3. Searches for These Classes

We can search for these classes by testing for Things or sets of Things that satisfy the class definitions. We would normally search in a set space, such as the stream of input. When a class that is a constituent part of some larger class is found, it could trigger a search for that larger part; this would allow us to search only for the smallest parts, and the rest would take care of itself if needed.

4. Searches for New Classes

In addition to looking for already-learned classes, we must look for new ones, based largely on recognizing classes which repeat often. This process will generally be done after searching for old classes, so that new classes build on old ones, using them as constituent parts.

Wednesday, July 12, 2006

New Forum

I've started a Forum for people making their own AIs at:

I have high hopes; I believe this forum fills a need not previously filled on the web, to some extent. Real coders, not just botmasters for some particular website, getting together to talk about theory-- and not necessarily the standard academics. Could be cool.

Sunday, June 04, 2006

AI Form

Here is how my artificial intelligence is supposed to work.

1. Static Patterns

The first thing which the AI is supposed to do is recognize concrete repetitions in the data. This means data that looks as if its been copied precisely from one place to another. This, I think, is the most obvious type of pattern, when the mind is untrained in a particular field. In many cases, we are not new to a particular form of patternfinding; in fact, for all of our senses, I think it's safe to say that we have built-in things to look for.

These 'built-in things' can be classified somewhat under the heading of looking for concrete repetitions that we have seen before. This is a second thing that we do (or, if we aren't new to an arena of sense, the first): look for the old familiar patterns that have guided us in the past. The AI must have a way of learning what to look for, based on what repeated patterns have been most useful. However, this really shouldn't be thought of as fundamentally separate from looking for repeats in the first place.

Looking for repeats is very mechanical, very simple. But it is the foundation of this model of intelligence. That is why I personally like this idea: that intelligence is founded on repetition, one of the simplest, most easily programmed ideas. If it is true, then there is no barrier between the human mind and the computer. No magical component to human thought that cannot be created in a non-human.

In fact, I give a warning. The idea of simply taking in concrete repeated patterns can be enticing to the would-be AI programmer. Technically, given enough data, it should be able to learn any pattern to an arbitrary degree of proficiency. But this is like assuming that we can only repeat the names of numbers that we've heard before. Many numbers have been said before; but if I were to make up an arbitrary 12-digit number, and was able to divide and multiply it by equally new numbers (say, sixty-eight point five two four nine) would you still say I was repeating only things I'd seen before? I've seen numbers and division, sure, but I'm able to make up the specifics. Something more complicated is going on. We don't just memorize; we also abbreviate, and in doing so, generalize.

2. Variable Patterns.

The first stage of this abbreviation is to define things that can "slip" in these static patterns. Just as a static pattern is defined by its constant insides, I like to say that a variable pattern is defined by its constant outsides. Anything set of objects that occur in similar context may be called a variable pattern. ("Similar", here, means that at least one identifiable element is precisely the same.) The search for variable patterns is similarly simple in concept. One can either tabulate the surroundings in which different perceptual objects have occurred, merging objects when some similarity is noticed, or one can look at some defining characteristic of a context, looking at all things that occur in that defined context and naming them a set (for example, all things that occur after a particular thing).

Note: defining either a variable or a static pattern should not limit the finding of others. Including something in a static repeated pattern should not limit it, and nor should inclusion in a variable pattern. Putting something in a set should not brand it forever; it is meant to facilitate further patternfinding, not to hinder other interpretations.

In practice, putting something in a static pattern or a variable pattern may well limit its use elsewhere. Patternfinding focuses on that interpretation of it. But flexibility should be included; some means of breaking down bad interpretations, and of looking at others.

The first use of variable patterns is to increase the scope of static patterns. Once something is put in a variable pattern with a few other elements, it and those others may be seen as "the same". If they are interpreted thus, the AI may be able to see an increased number of direct repetitions in the data. Once this ability is utilized, one can arrive at a model for data that can be used in the following way: the model can be unravelled from top to bottom, following directly the static patterns, and choosing randomly from the variable patterns. One then arrives at a random reconstruction of the data.

But we can do even better.

3. Correlation

We now have variables in data. And variables in data can be correlated.

If a static pattern is composed of at least two variable patterns, then we can look at the data to see if information about one gives us any information about the other.

In an uncorrelated state, if we have two variables, each with a certain number of states, the total possible number of states is the product of the two variable's state-numbers. But if, looking through the data, we see that less states exist, we can limit the stats that the variables can take on to make our model more accurate.

Each variable can take on all of its values. In an uncorrelated state, limiting one variable to a single value should not limit the other. But if we look through the data and discover that when we hold one variable still the other becomes more predictable, we have found a correlation.

4. Functions of Data

There is one more way of examining the data. Actually, this method is pretty general. It's like saying "the rest". In fact, the first three methods could be explained in terms of this one; but they seem so basic that I wouldn't want to.

This method takes any function which transforms the data, and then looks for patterns in that resulting data.

This method is intended to be used with the functions learned in the previous state. A correlation is, after all, a function. It takes in a value of one variable and returns the possible value(s) of the other. But in practice, not all functions absolutely have to be learned. Many can be (and should be) default. (Including default functions does not violate the goal of creating true intelligence if we already know that that function could be learned if necessary, given enough information.)

5. Integration

Each step creates another "interpretation" of data to be fed back into the whole process. But each also creates more, less visible data to be interpreted:

1: Each static pattern is not only part of a larger picture, but a pattern in itself. Static patterns should be examined for internal structure, not only external. Also, they should be compared to eachother, to look for commonalities of form.

2: Variable patterns, too, must be examined internally and in comparison to eachother. This involves a bit more: it involves first comparing several static structures to eachother, and then comparing this list to other lists like it.

3: Correlations, also, may follow patterns; rather than seeking out more data, one should sometimes look at the distribution of the data at hand to find more patterns. If we can determine in what way one state of one variable is tied to another in another in those ties that we have already found, we may be able to predict what is on the other side of the ties we have not yet determined.

4: Well, if I put something down for this one, it would seem a bit contrived. Only the first three need to wrap around. Not everything follows a pattern, you know.

Saturday, May 27, 2006


Always what matters is how higher concepts are tacked onto data.

Always, there must be "mappings". Time is always mapped. Space is generally mapped. To detect all patterns, all things must be mappable; a lack of mapping merely indicates a lack of common pattern. A mapping is an ability to look for commonality. In a visual grid, we would map each gridpoint to every other, so that we could search for a picture in one place or another, and recognize it as a repeat. On the other hand, we would not map them randomly; we would map the points according to the grid that they fall into, so that, say, (3,4)-(8,12) (a line) would map onto (4,4)-(9,12) or (3,0)-(8,8) or (7,14)-(12,22). We would not want to map it onto, say, a particular scattering of random points; not without evidence. Perhaps a pattern exists between that set of random points and that line; but it is not a property of the default space.

The point is that visual "pixels" (or any kind of basic sensory grain) usually do not stand on their own, to be compared in mass to every other grain, but come with some relations attached, generally forming some sort of space. A machine could eventually learn some of these mappings by such brute-force comparison; if the activity of a pixel can to some extent predict that of nearby pixels (and it should), then a set of correlations can be derived; the more similar in activity, the closer two pixels ought to be. But what then? The machine must be able to extend this "closeness" into the type of mapping that I mentioned. When a net of close pixels behave in a particular way (either behaving similarly to eachother, or following some other pattern) then the same activity in other pixel groupings must be seen as the same.

But I digress. The intelligence should have this grid already set out for it, and I only want it to be able to form it for itself to make sure that it has "complete intelligence". My original statement:

Always what matters is how higher concepts are tacked onto data.

The simplest way to tack on higher concepts is to notice direct repetition: exact likeness in two different places (two places that are "mapped", that is; related either in time or in space). The minimum requirement to find all static patterns is the ability to group any two patterns together if they look alike, whether they occur in data or as a grouped pattern. Two could then turn to three, three to four, et cetera, until all (or most) occurrences of the pattern had been found. It may be simpler, of course, to perform the full search at once, marking all instances of a pattern.

When searching for such static patterns, there may be (there are likely to be) options and compromises.

One problem: a larger pattern will likely exist that incorporates a smaller pattern, but has less instances. The solution here is to split the larger pattern up into various additions on the smaller pattern. The smaller pattern acts as the "core" of the larger pattern (or at least as a piece of the larger pattern). Another problem: two patterns will often overlap, so that parts of the data could belong to either. In a machine mind, one could set things up so that they did not interfere with eachother in doing so. But in a human mind, they will. The mind flips back and forth between the two interpretations of the data. The solution here, I think, is that both interpretations should be "kept in mind", allowing the intellect to choose between them based on their merits.

But sometimes a pattern actually requires something like overlap; for example, in finding simple patterns with numbers, one must recognize that one number is greater than (or less than) the last. They together form a pair with a particular slope. But an elongated run with slop forms many such pairs, and each number must belong to two pairs: a pair with the number previous to it, and a pair with the number after it. One possible solution would be to allow for detection of patterns of overlap. Another might be to use a different sort of an object to detect slopes and similar patterns. This object would be a "relation". The relation in question would be "higher-than" and "lower-than", possibly getting more specific (one-higher-than, two-lower-than...). Data can be transformed into sequences and then groups of such relations. Sequences of such relations should be treated just as raw data, meaning that they can be patterned according to relations as well.

Relations can be learned, based on patterns that occur often in data, but they should also be default items when it comes to such things as numbers and colors. With textual data, each letter can be treated individually; and the same is true for visual data that uses a small number of possible colors, such as 16 or less. But with data that delivers quantities instead of qualities, default relations must exist. If these are numerical (and most will be), then these relations consist mainly of less-than/more-than. Specifics may be given (2-less-than, 18-more-than), but not necessarily; with colors, probably not. Relations may be only addition-based, or may be multiplication-based, or may even be power-based. In addition to numerical default relations, one could use any other imaginable scheme; for example, a variable might be three-dimensional, but not based in a grid; rather based on an alternative 3D tiling. Variable-states could be related in an arbitrary network; a hierarchy; a number-line that branches and loops; whatever.

Problems to be addressed:
(1) In a way, I'm still only talking about static patterns. Variable patterns are also necessary.
(2) I didn't actually provide any way for such relations to be learned; I only said that it would be possible.

These two problems should be related; "variable patterns" create "variables", and "relations" connect "variable states".

It all starts when two variables of the same type predict eachother.

This creates a mapping of one value onto another; a "function".

A function does not necessarily mean an order; it only does so if each value is uniquely mapped onto another. Even if this is true, the order may not encompass all values; the function may only partially order the values, or may make multiple, separate orders (so that groups of values are related to eachother, but not all values can be related). Additionally, there may be multiple such mappings, due to multiple different learned sets of relations. One could imagine some confusing schemes indeed.

And, yes, "mapping" is the proper word. Relations are what orders space and time, just as we are using them now to order variables.

And now a digression. I still have not written exactly how variable patterns are learned, which is part of my original point (see first sentence). But I will now clarify the nature of mappings (or attempt it) with examples.

Chiefly, I must clarify that mapping is somewhat different from ordering. If something is mapped, but not ordered, over time, then we have many "moments", but do not know in what order they occurred. Each moment is separate, but moments may be compared. We may still look for repeated patterns in moments; moments may resemble eachother either in whole or in part. Moments may be classified by their similarities. If such classifications reduce the number of options for other variables, a predictive pattern has been found.

If time is ordered, the difference is simply that a particular moment may predict other nearby (or even farther away, in principle) moments. To look for such patterns, we "collapse" everything into an abstract idea of "the current moment" and "nearby moments". All moments fall under each category, but their is order within the broad groups; each "present" has an associated "past" and "future". We then collapse the present category by various classification schemes. When we do so, the past and future collapses likewise. If, for a particular scheme, the future or past becomes more ordered upon collapsing, we have found a pattern.

If we only want to look for instantaneous temporal influences, in which one moment effects the next, we can collapse everything into two categories: past and future. Each moment counts as both a "past" moment with an associated immediate "future" (the next moment), and as a "future" moment with an associated immediate "past" moment. Using more moments then this in the mapping simply looks for longer-term influences; for example, three units can detect influences that skip one moment before showing. For a computer scientist, this is an Nth-order markov model, but with more complicated states than a single multivalued random variable.

For "completeness", we would want an AI to be able to expand the field of mapping arbitrarily far. This means to me that any moment points to the next moment and the last moment, and that any such ordered and mapped variable (be it space, time, or otherwise) is able to associate in this manner. Alternatively, one could simulate this pointing by re-acessing the larger datastructure every time. This simply means that one must remember the location of everything one is thinking about. But the choice here only effects computation speed and memory efficiency, so I'll dwell on it no longer.

So: we can have mapping without ordering. Can we have ordering without mapping?

I was wrong to say "And, yes, "mapping" is the proper word. Relations are what orders space and time, just as we are using them now to order variables.". Here we have ordering without mapping. Maybe.

Sunday, May 21, 2006

Classification of AI

This classification system is somewhat based on cybernetics, somewhat based on terminology of the AI field, and somewhat based on my own metaphysics. The paper is essentially in order of importance.

The idea is this: although the details of an AI differ, all artificial intelligence must accomplish one, several, or all of a few basic tasks. We can classify artificial intelligences by what requirements they fulfill; this tells us how "complete" they are.

I will detail the main elements by listing and then subdividing. Each level could be presumably subdivided further.

The Two Elements:

1. Knowledge-seeking
2. Goal-seeking

A simple organism starts out with only direct reaction: it metabolizes food, and performs other acts of homeostasis. Eventually, simple knowlegde with direct response is incorporated; various things start to count as pain and pleasure, and a simple awareness develops. Bacteria will often know to go towards food, for example. The knowledge and goal is not learned by an individual, but learned by evolution and kept in "genetic memory".

Eventually, more complexity arises. We can characterize a mind as consisting of intertwined knowledge-base and goal-base.

Knowledge is applied to goal-oriented behavior. Without knowledge, a goal is useless; without a goal, knowledge is useless. Both need the other. These are the two complementary elements of mind.

The Four Elements:

1. Induction
2. Deduction

3. Emotion
4. Planning

Both knowledge and goals must be first created and then expanded.

Knowledge creation is called induction. It is pattern finding; it deals with identifying known objects and learning new ones.

Knowledge expansion is called deduction. It is logic; it deals with extrapolating further from patterns found in data.

Goal creation is called emotion. We create goals based on different sorts of pain and pleasure.

Goal expansion is called planning. It essentially harnesses logic to control the environment.

Induction is my chief aim. Deduction closely resembles the cold, hard logic that computers already do so well. A computer is able to use a model that we input. But to make a new model? That's something that still needs work. Induction is what interfaces directly with the outside world in the form of senses. It deals with data. An A.I. that dealt only with induction would take data in and spit out an analysis. It would perform mathematical regressions, statistical analysis, or the like. Or perhaps it would be more advanced, finding more general types of patterns as well. But, essentially, it would only provide the user with a model of the data. Neural nets are also under this category, generally, although they might be used for pain/pleasure learning.

Deduction, however, is still something of an issue. There is not one version of "logic", despite the impression people may get. There are many versions of formal logic. Some have hidden contradictions, and there is no way to know which until the contradiction is found. On the other hand, a computer can do most acts of deduction rather well. A program that only performs deduction is rather common. A calculator is one. However, much more advanced types of deduction can also be carried out. Besides advanced mathematics, we can input models of the world and get various results. These models may be formulated in various logical languages. Bayesian networks are popular now. Formal logic is arguably more general.

Goal creation is not strictly necessary for an A.I.; usually, the goal for the A.I. will be set. Even biological organisms do not really need goal creation; the simple, single goal of maximizing pleasure would do well enough. However, it is expedient to create new goals. We "like" or "dislike" things originally because they give us pleasure in some way, but we eventually disconnect the liking/disliking from the original reason. This gives us ethics, in a way. We decide that certain things are "good" and "bad", rather than simply to our advantage or against it. Our ultimate ideas of good and bad replace our sense of pain and pleasure. An AI based on emotion would probably be based on behaviorism; it would act according to the rewards and punishments received in the past. It might also go one level higher, adding fight-or-flight response and other emotions.

Goal expansion caries out the whims of the emotions. It is heavily based in deduction, but may use more particular search strategies or logical languages. It is difficult to make an AI with only goal-elements, but this type of deduction can be the focus of an AI, and has been the focus of many.

These could possibly be split up further; splitting up "induction" further, in fact, is the subject of the first post to this blog. Also, one might wish to add "data" and "actions" as elements, because the types of data and the types of action vary from AI to AI. One AI may be in a fully self-animating robot, while another may only have text output. One may read medical data, while another may read books. We get a chart something like this:

data -> { -> (induction -> deduction) -> (emotion -> planning) -> } -> actions

But there is one more element that we humans have: honesty. We have a natural tendency to say what we think. Talking is not generally a goal-oriented behavior; we do not plan everything that we say according to what it will get us, but first think to the truth. We CAN say things because we decide it will accomplish something. But our automatic action is not to do this, but to say what's on our mind.

We view language not like just any data, but like data that talks about the world. Most inputs are merely part of the world. But language is a part of the world that represents another part. For an AI to truly use language, it must have this view.

It could possibly develop naturally, without hard-wiring; an intelligence could learn that what people say can often be used to predict the world. Similarly, an AI could learn to say the truth by default as a part of goal-creation; if it has to communicate its needs regularly, it could develop as a separate goal. However, it could also be programmed specifically in addition to or in replacement of other goal-elements. A calculator is a perfect example of a truth-speaker. It is an excellent conversationalist. It always knows just what to say; it always has a response (even if it's ERROR: DIVISION BY ZERO). Any deducer could have a similar conversational model of deducing and reporting facts from things that the other person says.

"It's sunny out today."

"There must not be many clouds."

"No. It rained last night. I guess that used them up."

"If the rainwater evaporates, there could be cloud formation."

"I saw three porcupines dead on the roadside."

"At that rate, the local population will be depleted by 2006."

Such an AI could have no goals whatever. All it needs to do is reason, and decide what facts are most relevant. On the other hand, one might want both goals and a truth-saying default. This would be possible as well, though much more difficult.

Oh, and one more: copycat learning. This also may be an important element in an intelligence. Whereas truth-saying is mainly deduction, copycat learning relies mainly on induction. Many chatbot AIs use the copycat strategy to mimic human speech.

Most importantly, we have interplay. These separate elements are not really so separate. Each element is constantly active. Each depends on the others. The ultimate behavior of the AI emerges as a synthesis of many elemental activities.

Friday, May 19, 2006

AI formalization

This is the current formulation of my AI. Several things must be realized:

First, this is only the pattern-finding portion of my AI. The point here is to be able to potentially recognize any pattern in data (though in practice any implementation is limited). A "full mind" would also need deduction based on these patterns to be intelligent, and would need goals to act on in order to use this intelligence.

1. Data is made up of atoms in datastructures.

2. An atom is a sensory object that may have one or more variables.

3. These variables can take a certain number of states (be it a finite number, or a workably infinite number in which a new state can be created at whim, based on a relation to existing states).

4. A particular type of atom is defined by a particular set of variable states.

5. Datastructures are relations between many such instances of atoms. Atoms may be strung in a line, placed on a grid, or put an any order generally, depending on the nature of the data. It does not matter if these relations are temporal, spatial, or more abstract: the structure is simply adapted as necessary. There may also be a total lack of structure; atoms or larger structures can be an unordered set, rather than a spatially or temporally ordered one.

6. The variables of a particular set of data include the variables of the elements (the "atomic variables", so to speak), and any patterns that the data fits into. (Since all patterns that may describe the data are not known, this second type of variable must be added as it is spotted.)

7. A static pattern type is defined by a set of variables in data; an occurrence of such a type is the realization of those values in the data. (Most commonly, these variables are atomic, so that the static pattern is a string of atoms that can be found in the data in many places; a word that can be found repeatedly in text, for example. New static patterns can be found by searching the datastructure for repeated strings of values, be they in the atoms themselves or in more abstract regions.)

8. A variable pattern type is a set of such static patterns; any one of the possible value combinations will do to identify the pattern as belonging to the variable type. (A good example of a useful variable pattern is a set of static patterns that behave in nearly the same way, appearing in the same contexts; they are put together as a single variable pattern, allowing these several static patterns to be recognized as functionally the same thing. The variable pattern can be thought of as an uncertain value; a set of possibilities. Predictions can then be made as to what set a thing will be drawn from, when exact predictions could not be made, et cetera. Even if exact predictions can be made, though, the variable pattern can still be used; the domain and range of a function, for example, can also be thought of as variable patterns.) Any set of patterns may be used as a variable pattern; this means that a constant pattern can always be looked at in terms of where else it has appeared.

9. The set of instances of a pattern is an unordered set. (The only relations between instances are any that existed in the data, and equality of the defining variables of the pattern type.) This set, being a datastructure, is searched for patterns as well.

10. Any values not set by the pattern definition are variable patterns, containing the set of all values that have occurred in that slot as their set of values. (This includes both concrete variables and abstract variables in the data.)

11. Sub-types are constructed by setting more variables. Once this is done, more variable types may be created by listing all the values that still occur in the yet-open variable slots, despite the field being narrowed down. (The variable pattern consisting of all subtypes of a certain static pattern is equivalent to that static pattern.)

12. A function is constructed as a list of possible fixed values for one variable in a type, and the related values for another variable of the type. If fixing one variable decreases the options in another, then a correlation has been found; one variable is a function of another, carrying great deductive advantage. (This method of patternfinding in uncertainty, defined in 11 and 12 for finding functions in types, could actually be used for any comparable data situation.)

13. Patterns in functions may be found, as well. The method here is the same as in data, but focusing on describing functions in terms of simpler functions. (Simple functions are remembered as static patterns that can appear in data of certain types.) A pattern may be looked for between a particular input and its output, or, alternatively, between the outputs themselves.

14. The variables of an item include all places where that item has appeared. This includes all functions in which it has appeared. This allows expressions to be created which stand for certain values, but as functions of other values; 144, for example, could be remembered as the square of twelve, rather than as itself. (Expression-creation is an important way to represent a function in terms of simpler functions; if a pattern cannot be found for the unmodified items, perhaps a pattern can be found in their expressions (for example, each might be the square of the next-higher prime number than the last; in such a case, no simple direct relationship between the squares exists, but the pattern is much more obvious if they are all expressed as squares of some number, rather than as just bare numbers). This is essentially the same activity as finding some pattern in the variables of a list of items.)

15. Such expressions, if used as general rules, may describe an infinity. Because a pattern can be described by an expression rather than memorized, and a description is a pattern which can hold true for an infinite number of cases, a pattern may (in effect) be infinite, and still held in the simulated mind, if its (infinite) behavior can be described (finitely). This means that spaces need not be defined only as discrete data structures (as they are in #5); continuous spaces may be imagined, which contain an infinite number of points in a finite space (as is described in geometry). Such spaces are defined by the simple idea that a new point may be between any two points. This is a finite definition which holds true for an infinite number of cases. The same method may be used to define infinity ("bigger that any number" holds true for an infinite number of cases), infinitesimal ("smaller than any positive number" holds), repeating decimals ("three in all decimal places," for example), and other structures. This method is also related to "mathematical induction" and "recursive definition". (In truth, I am not entirely sure how a computer could do this.)

16. In addition to looking for patterns in raw data and in subtypes and in functions, patterns between types may be found, as well. First off, since the definition of a static type is essentially arbitrary (that is, simply memorized), we may look for patterns in it in terms of smaller elements, to find its pattern. With language, this is equivalent to splitting a word into clearly defined syllables in one's mind only after learning the word. The syllables are there implicitly, but not recognized. Also, patterns behind other variables than internal structure should be looked at; to continue the same example, this would be like trying to figure out why the syllables create a word of that meaning, based on their meaning in similar words.

17. Patterns behind patterns, as such, may lead to another very important abstraction: numbers. If several patterns exist that could be described by numbers (that is, types for doubles and triples of a few different patterns), then these patterns will be related in the fact that larger patterns of the same type contain eachother. If this relation could be seen as a pattern (perhaps that's a stretch), then the fact of it could be first established separately for a few cases (one apple/two apple/three apple -> any number apple, plus one lemon/two/three -> any number lemon), then these separate types of numbers could be consolidated into one invariant type: the number (assuming that this pattern could be seen as the same in the multiple cases). In truth, I suppose I don't think humans invent numbers in this way (as babies or whathaveyou), but that numbers are inborn; but even so, they are still constructed from more basic ideas.