Let's take another look at the two statements I wanted to make about probability.

1. A probability asserted by a human is really a believed frequency.

2. A probability is a statement of uncertainty that could always be turned into certainty given more information.

If the second statement is true, then any probabilistic model is inherently incomplete. This means that there are no single-event probabilities; any "believed frequency" I assert is wrong unless it's 0 or 1, because a single event can only happen or not, and it's meaningless to give a probability.

What is meaningful however, is giving a probability based on the limited information at hand. In doing so, we give our belief about the frequency that would result from looking at all situations that have the same givens; situations that would look the same from our perspective, but might turn out differently. I'd argue that this is basically what people intend when they give such probabilities.

However, this also has some snags: I can't seriously assert that people believe such parallel situations always literally exist. People could give probability estimates in unique situations that never occurred before and may never occur again.

To get past this, I reformulate my statement:

People give probability estimates (1) based on the limited information at hand, (2) only using relevant information, and (3) ignoring potentially relevant information if it doesn't match any previous experience and so doesn't help predict.

The first addition, that only information thought to be relevant is used, helps somewhat by reducing the previously crippling number of unique situations. Now, situations can be considered the same if they vary only in ways irrelevant to the event being predicted. The other addition, however, that potentially relevant information be ignored if it turns the situation into a unique situation, is the real clincher. It guarantees that the probability estimate is meaningful.

But there are still problems.

Clause 3 above may fix everything, but it's pretty problematic from the point of view of machine learning. A prediction made by ignoring some evidence should be given lower certainty. The math there is straightforward; we have some probability of the variable being relevant, and we have no idea how it effects things if it is. We therefore weight the two possibilities, adding our prediction of what happens if the variable isn't relevant to an even wash if it is. (This is an inexact statement, but whatever.) So the result is weaker for each ignored item.

The probabilities-of-relevance here are necessary, but to fit them in to the interpretation, must be given the same sort of interpretation; in other words, they've got to be estimates based on the limited amount of relevant information and so on. The "so on" includes a potential infinite regress because we need to again weaken our statements based on any potentially relevant but new variables, and again this involves a probability of relevance, which again must be estimated in the same way, and so on. However, I'm not sure this is a problem. The reason I say this is that I see it as a series of progressively better estimates; in practice, we can cut it off at some point if we need to, and just use the coarsest possible estimates of the next-down level of probabilities. This could be reflected in a further modification:

People give probability estimates based on as much of the relevant information at hand as they can quickly decide the consequences of.

In other words, we don't compute instantaneously, so we may not use all the relevant information at our disposal, or may use some only partially (using it in the most important estimates, but making less important estimates more quickly by ignoring more information).

This basically seems to reconcile the statements I began with, (1) and (2) at the top of the page. However, I'm still not completely sure about (2). I still may want to assert that there are actual alternatives in the world.

## Thursday, December 13, 2007

## Saturday, December 08, 2007

Anyway, I've realized that in the past months, I drifted further and further into bayesianism with my thoughts on AI. That meant not only assuming that the bayesian math was the ideal that an implementation should approximate, but also putting more and more work onto Bayes Nets at the base level of the AI (and therefore spending most of my time figuring out how to extend bayes nets to get them to do what I wanted, rather than thinking about the entire system *around* the bayes nets).

So, realizing this, I put the base-level statistic on the back burner for the moment and went back to theorizing about the system as a whole. (Coming up with a great base-level statistic is still a good idea, but not of ultimate importance, to some extent.)

Here's where I was before I got distracted:

0. (base level) The system uses some set of base-level patterns; these can be almost anything, or a combination of several things, but we must require that there aren't any "cheaters": patterns self-report how well they fit particular data, so some pattern could just claim to be perfect. In other words, the base-level statistic needs to be somewhat smart to start out. We apply the base-level statistic to the data, recording where each possible pattern succeeds and fails. (For example, if we're using neural nets, what we're doing is going through every possible neural net and recording how well it fits each spot in the data. In practice, we can't do that, so we focus on neural nets that we get via standard training algorithms on local data subsets.)

1. (first wraparound) The base level statistic is applied back onto itself: it is used to predict when particular patterns work well and when they work badly. It then is applied back on itself again, to see where these second-order patterns work well and badly. And so on. This forms a hierarchy of patterns, which is good for two things: (1) it can increase the expressive ability of the base-level statistic; for example, if the base-level only looks for correlations between two variables, the hierarchy represents larger correlation sets; (2) it can increase the compressive ability of the models; for example, if the base-level simply records common sequences that can get arbitrarily long, there's no increase in expressive ability, but there is a great potential increase in compression of models; essentially, we're adding the ability to represent repetition in these sequences. A "hierarchical pattern", as opposed to a "base-level pattern", is a pattern that can be expressed using this method.

2. (second wraparound) The system also records a global average success for each pattern. (To be more bayesian, we *could* instead keep a density function representing current belief about each pattern's probability.) These records are examined for patterns, particularly patterns in the hierarchies formed by the 1st wraparound. What are the common features of successful hierarchical patterns? For example, do several hierarchical patterns look similar to each other except for isolated difference? Can we make a fill-in-the-blank type form that summarizes them? What typically goes in the blanks, with what probabilities? I typically call this new sort of pattern a "metapattern".

3. (third wraparound) When the system learns a fill-in-the-blank style metapattern, then the possible blank-fillers should be regarded as now being "the same" in some respect. The third wraparound specifies that whenever this happens, a new property should be invented for this purpose and attached to the elements in the data that satisfy it. This new property can then be used to form new base-level patterns. This allows those things to be grouped together for future purposes, so that the system can learn new patterns that reuse the same type of variable elements without having to go through the whole process of learning a metapattern again. Think of madlibs that reuse blank types such as "noun" and "verb" over and over again. These new types of pattern are what I call "variable patterns".

This system has a few problems, but the one I'm most concerned about is the difference in style of the 3rd wraparound. To me, it doesn't seem to fit. It's not really a "wraparound", it's just a statement of availability of a particular fact to the base level. In fact, the entire series of wraparounds is a statement of where we can look for patterns. Instead, why not go from the opposite direction: let the system look everywhere, except where we explicitly state that it can't?

"Anywhere" means: when we record how well a pattern fits on a particular location, we add that as part of the data. The global record of the strength of different patterns counts as part of the data. The definition of each pattern counts as part of the data.

From this, the third wraparound emerges by itself: so long as we can draw a connection between a pattern-instance in the data and the pattern's abstract record of success, metapatterns will be seen as directly related to the base-level data, and so participation in a metapattern can be seen as a base-level property.

The first problem that occurs is that the system could find may useless correlations here, for example the obvious correlation that will exist between the definition of a pattern and what sort of data it has a high score at. So we add our first restriction:

1. Anything that we can already logically deduce, we are not interested in learning probabilistically.

This restriction cannot be enforced completely, because we'd have to completely determine everything that we can or can't logically deduce; but that's fine. Instead, the system should ignore relationships that are logically obvious: things that it can quickly deduce. (this will be dependent on many things, but the sensitivity is a good thing, not a bad thing. The fact that we must draw an arbitrary line for how long a thing takes to deduce is also forgivable in my opinion.)

So far, I haven't seen the need for any additional restrictions.

The problem now is that I'm nto sure of the generality of this method. What class of patterns is it capable of learning? Since I abandoned the Turing machine as my most general class of patterns, I don't even know how to define the class of patterns I'm concerned with. I know "statements in predicate calculus" basically covers it, but there are several concerns:

1. Is 1st-order predicate calculus good enough? Or do I need higher-order predicate calculus, in which case I'd (personally) just go to using axiomatic set theory?

2. What atomic statements do I need? Do I need equality? Probably. Do I need arbitrary predicate operators that can be defined by partially (by making statements concerning their behavior), or can all needed predicates be composed of existing properties in the data (embedded in quantified boolean-logic statements)?

Of course, I'm not considering literally using predicate logic; what I mean is not "do I need X in predicate logic?" but "do I need something in my system that would translate to X if I translated everything to predicate logic?".

Before I can even address those concerns, however, another problem jumps out: how can I represent quantifiers to begin with? First, the quantifying I'm doing is probabilistic where predicate logic is not. That seems fine, though. The real problem is that the system sort of implicitly quantifies. Patterns state what variables they quantify over, but are silent about what variables they might be holding still. The system then takes the set of variables being quantified over and sort of roughly quantifies over them all at once; but in predicate logic, the order in which we quantify is important, so the two approaches are at odds.

## Saturday, November 17, 2007

You know that something is seriously wrong when, in a science, there is as serious of talk about correctly interpreting a theory as there is work attempting to extend the theory or propose new theories. I personally think it is ridiculous that so much effort is put into "interpreting" quantum mechanics. There's the many-worlds interpretation, the ideas about what counts as an "observer", questions of whether or not a particle actually has a definite speed and location if they can't even in principle be observed simultaneously (what can't be observed doesn't exist), arguments about what it means for an event to be probabilistic, and so on. I'm frustrated both because these don't seem to be real questions (or where they are real, we should experimentally test them rather than argue), and because I want to claim obvious answers myself.

But enough of criticizing a science I know little about. Another field, which I know somewhat more about, is experiencing a similar problem: probability theory. The argument between Bayesianism and Frequentism, two separate schools of thought that have their own very different mathematics for statistics and machine learning, is essentially an argument about the meaning of probability.

Frequentists interpret probability based on the idea of random variables and physical random processes. Probability is defined as a statistical frequency; a probability for an event is the ratio at which that event will occur if we sample for a sufficiently long amount of time. For example, if we flip a coin enough times, we can determine to any desired degree of accuracy the actual ratio of heads to tails for that coin. This seems intuitive, right? This notion turns probability into a solid, objective quantity. In fact, frequentist probabilities are often called "objective probabilities".

Bayesians, however, disagree. For a bayesian, probability can also represent a degree of belief. This distinction is particularly important when an event only occurs once, or when we're talking about the probability of a statement, which can only be actually true or actually false. Frequentist probability cannot be used here, because there is no frequency to measure. For example, suppose we're back before the time of Magellan. You ask me my opinion on the shape of the Earth. I, being well-versed in the popular philosophy of the time, suppose that the earth is round; but I'm not sure. So I give it a 95% chance. From a frequentist view, this is nonsense. The earth is either flat or round; it isn't a physical random process. The frequency of a flat earth is either 1 or 0.

At this point, you're probably siding with the frequentists. The earth does not have a 5% chance of being flat-- sticking an exact number on it sounds silly, even if I want to say that there's a possibility. But let's consider an example you might sympathize with a little better. Suppose you have a friend who is purchasing a vehicle for the first time. You know that your friend (for some reason) refuses to own any car that is not either pink or purple. However, you don't know of any bias between the two colors, so before you see the car, you can do no better than assign a 50% probability to each color. Upon showing you the car, which is pink, your friend explains that the decision was made based on alphabetical order; pink comes before purple.

Now notice-- you had no way of knowing which of the two colors your friend would choose. It seems very reasonable to assign equal probabilities to each. However, such a probability does not seem to be a frequency-- if we rewound history to "try again", the friend would always reason the same way, and the new car would always be pink. Probability can only be interpreted as degree of belief here; and that's exactly what bayesians want to allow. In contrast to the frequentist objective probabilities, bayesian probabilities are called "subjective probabilities". (It is fairly common practice to admit both kinds of probability, so that one might say "The coin has an objective probability of landing on heads [since it could land on either heads or tails], but a subjective probability of being counterfeit [it either is or it isn't]".)

The battle between bayes and frequency has been a long one, but that's not exactly what I'm going to talk about. I'd like to talk about my own grapplings with the interpretation of probability. While I am (at the moment) firmly bayesian in terms of the mathematics, I actually agree with both interpretations of probability; I think that, all probabilities represent a belief about a frequency. But I also want to be able to say a few other things about probability, which I'm not sure are entirely consistent with that view.

Here's what I want to say about the meaning of probability:

(1) All probabilities are a belief about a frequency.

-Even the most obviously frequentist example of probability, coin flipping, can be thought of in this way. We never know the exact distribution of heads to tails; but we use the 50/50 estimate regularly, and it causes no problems.

-Many probabilities that seem to be fundamentally bayesian can be explained with the concept of possible universes. The frequency of a flat earth in our universe is 0. But the frequency of a flat earth in all possible universes may be higher. Even if it's not, because we can never know probabilities for certain, but only estimate them, it's entirely reasonable for someone who does not have the benefit of modern science to estimate something like 5% of all alternatives universes have a flat earth. The estimate can be improved later. (Because we can only ever visit one universe, our estimate of the distribution of possible universes will always be quite crude; so probabilities based on such estimates will have a "subjective" feel to them. But I claim that they are not really a different kind of probability.)

-When counting the frequencies, we only consider universes that match ours sufficiently; in the flat earth example, we consider alternative universes that match everything we *know* about earth, and estimate the frequency of something we *don't* know: whether the earth is flat or round. Similarly, in the example of the pink car, we consider universes that match things we know, but (being only human) are unable to use facts we don't know (such as the fact that our friend loves using alphabetical order to resolve difficult decisions). (This is called a "conditional probability"; if you think it doesn't sound very well-defined, I assure you that it is a very mathematical notion, which has been rigorously founded in logic.) This explains another reason that bayesian probabilities seem "subjective": different people are often considering different evidence (pinning down different facts) when giving probabilities.

(2) All probabilities are really just statements about uncertainty.

-I'm claiming here that the concept of a "physical random process" is philosophically unnecessary. When an event like a coin flip seems random, it is actually quite deterministic; we just don't know all the relevant factors involved. Even if we do know all the relevant factors, we aren't necessarily able to calculate their influence on the final result (at least not quickly).

-Whenever we assign something a probability, then, it's simply because we don't know all the relevant facts (or haven't calculated their influence). Quantum mechanics, for example, gives us probabilistic laws for the behavior of particles; I'm claiming that the probabilistic nature of these laws shows that they aren't the fundamental laws, and that our predictions of quantum events could in principle be improved with more information (or perhaps with a better method of calculating).

To be perfectly honest, I'm less certain about the second claim. I think both claims seem to be what's intuitively correct; but the two seem to contradict eachother.

If probabilities are statements of uncertainty, can they also be frequency estimates?

It seems possible in at least some situations. For example, when flipping a coin, we can create a frequency estimate for both heads and tails, but still claim that if we knew more, we would be able to calculate ahead of time which it would be. In this case, the frequency is based on common known variables from toss to toss (mainly the coin involved), whereas the uncertainty is caused by the unknown variables (the physics of the toss). But this doesn't necessarily solve everything.

The first conflict that I see is that the idea that there are other possible worlds, necessary to my argument for (1), seems to be ruled out by (2). If anything can in principle be narrowed down to one possibility by knowing the relevant facts, then there can be no actual alternatives! Alternatives are illusions, which can be eliminated by the diligent scientist.

Never mind the conflict with (1)! Is this view even self-consistant? (2) asserts that any question can be answered by looking at the relevant information. But this implicitly assumes that every event has a cause. The result is an infinite regress of causes. (While this isn't a direct inconsistency, it is reason to worry.)

So, I'm willing to back down on (2), instead stating two possibilities:

(A) The universe has some set of first causes. Therefore, there exist real alternatives; the universe actually could have been different. The probability of these different universes is unknown and unknowable, because we can only observe on universe, but it isn't an unintelligible concept. (1) holds but (2) does not hold if the event we're examining happens to be a first cause.

(B) The universe has no uncaused events. There is an infinite chain of causes leading to each event. (2) holds, but (1) does not always hold: the universe is deterministic, so probability cannot always be interpreted as a belief about frequency. Specifically, (1) doesn't work when we would resort to alternative universes because we don't have multiple instances of the given event in our universe.

I'm a bit torn between the two, but I prefer (A). I don't have anything against the infinite past implicit in (B), except that the entire timeline has an uncaused feel to it. Why this timeline rather than another? If (B) is correct, then there is some reason. OK. Sure. But (B) also states that this reason has a cause, and it's cause has a cause, and so on. Another infinite causal chain, over and above time. But what caused that infinite chain? And so on. The idea of infinite causes is a bit unsettling.

So is the idea of an uncaused event, to be sure. But I like being able to say that there are real alternatives, and to say that, it seems I've got to admit uncaused events.

Notice that these problems melt away if we restrict our talk to manageable islands of reality rather than entire universes. Both (1) and (2) hold just fine if we don't run into an uncaused event or an event so singular that none like it have ever occurred before or will occur again.

Can we ever know what's really true? Can we experimentally differentiate between (A) and (B)? Probably not. Then why does it matter?

Maybe it doesn't.

## Wednesday, November 14, 2007

I've started an AI club at my university, which is great. I mean, we actually didn't have one! Shouldn't every self-respecting computer science department have some crazy AI people stashed in a corner trying to change the world?

Well, we're not there yet-- our short-term goal is to start a small AI competition in a game called Diplomacy. Turns out there's a pre-existing framework for it:

www.daide.org.uk

Also, I've been looking at something that goes by various names, including "competent optimization". I'd call it "intelligent search". Based on genetic algorithms, the idea is to think while searching; more specifically, based on what's been seen so far, attempt to learn the characteristics of good solutions, thus guiding the search.

http://www.cs.umsl.edu/~pelikan/boa.html

http://metacog.org/doc.html

The idea of intelligent search is something I've thought about before, so I was both pleased and disappointed to see it already being researched. This means I can't say I invented it! To to meaningful research in the field, I've got to up the quality of my ideas :).

Of course, really, I haven't done any "meaningful research" at all yet. So it goes. Part of my problem is that I don't focus on one thing well. Also, I seem to like getting started far more than finishing. Coming up with ideas is more exciting than implementing them.

To implement competent search, or at least to implement the Bayesian Optimization Algorithm (which I guess is the current best), I'll need a Bayes Net framework. There are many to choose from, but here's the one I picked:

http://www.openbayes.org/

Probably the Bayes framework in Matlab is the best (except for speed issues), but Matlab costs money (although this Bayes framework for it is free and open source).

http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html

So I've been having a lot of thoughts about possible improvements to these intelligent search methods. Mainly, I'm trying to figure out what the theoretically perfect way of doing it would be-- that is, assuming that the only thing that takes computational resources is the actual testing of a point, and so that we can do any complicated analysis we like to decide what point to pick next.

## Tuesday, October 09, 2007

I realized recently that, while it seems reasonable to use "any computer program" as the most general type of model for the world (particularly for computer-based artificial intelligence), it certainly isn't the most general model type. Computer programs represent computable models. A more general class of models is definable models.

An example of something that's definable but not computable: all numbers that can be printed out by a program that's less than 100 kb in size. The reason this can't be computed? You might think at first that you could in theory generate all programs less than 1 kb in size, run them, and record the outputs. But you can't! This is because of the "halting problem": there is no general way of knowing how long a program will take to run, and if it will ever stop. Many of the programs you generate will contain infinite loops, and so will never produce any output (or will continue to produce output forever-- let's say we want to discard the output in such a case, because we don't know what to do with infinitely long numbers). So, basically, while you might expect the process to merely take a long time, it will actually take forever: you will never finish running the programs, so you'll never get the completed list of numbers. And if you stop early, there's always the chance that you're missing a number or two (since there is no general way to distinguish a program that's on an infinite loop from one that is just taking its time).

So, why is this important? Why do we want an AI to be able to make models of the world that are definable but not computable? Because physicists and mathematicians do it all the time, so it is obviously a basic human faculty, necessary to the understanding of the universe we live in (unless you think physicists and mathematicians have been somehow warped by their disciplines into creating these actually meaningless models).

One point of irony: because of the halting problem, most AI theories that restrict the AI's domain of models to the computable are themselves uncomputable (they must be approximated, in practice). This means that although a human could understand the general guiding principle behind the design, any AI based on such a theory would be incapable of comprehending the concepts behind its construction!

Does this mean that computers are fundamentally unable to think at a human level? Actually, no. An AI could make any model representable by formal logic, rather than any model representable by Turing machine, would be able to reason about any definable model. It would be able to understand both the principles behind a Turing-machine-based AI, and those behind itself. (Serious anti-AI folks would invoke Godel's Theorem here, though, and claim that any formal system is incapable of fully comprehending itself. This is true, but also appears to apply to humans, so isn't as concerning.)

## Wednesday, September 05, 2007

In principle, the problem is already solved-- we already know the search space, and could (again, in principle) simply look through all possible hidden variable configurations. All we need is a reasonable prior. (This is not entirely non-trivial, but we'll save it for later.) This search space, however, is huge. It seems likely that there is some redundancy here; and even if there aren't, there should be some general principles that tend to yield correct results-- hidden variables that are typically useful. This would allow the continuation of the paradigm I've followed thus far: search for the simplest patterns quickly, but search slowly for the general case.

The fundamental question here is when to check for a hidden variable. An interesting way to rephrase this: when do we accept something as a general pattern, and when do we instead think something requires an explanation? For example, if all squirrels had a striped fur pattern, most people would accept that as part of what a squirrel is. (Evolutionary biologists would be the minority asking for an explanation.) However, if a particular box made a strange tinkling noise when shook, we will probably wonder why.

I think some of what's going on is that we accept patterns when we have little hope of discovering the hidden mechanisms, but question them when we have some idea of the structures involved and so can hope to reach a solution. In the case of the box, we know that there is something hidden inside it, and that what's inside determines the noise. This is not the phenomenon I'm interested in. In this case, we already know that there is a hidden variable, and are merely learning about it. How do we come up with the hidden variables in the first place?

One strategy might be to assume that each basic variable is "masking" a hidden variable. This means that, essentially, we copy over the visible network to make a hidden network that we assume at first is fairly similar in structure. We then make reasonable modifications on the hidden network to improve its predictive power. The first type of modification would be to reduce static. Aberrations in the visible variables that have no causal effect (they do not seem to modify their surroundings) do not need to be a part of the shadow network. This may reduce the amount of past-dependance needed; where before the state a few steps ago might be necessary to predict the current state because static blotted out what was inbetween, now it is evident that the state depends on hidden steps that were masked by noise. Space-dependence may reduce for similar reasons. This is good, but so far everything I've mentioned can be done without a shadow net, by just recognizing the uncertainty in the initial variables. A more interesting effect would be the possibility of a different sort of variable pattern: a single state in a hidden variable might take the role of multiple states in the visible variables. (To this end one might purposefully experiment with shadow networks containing variables with fewer states than the actual network.) But the most promising results should come from changes to the structure of the hidden network. Several closely related variables might be dependent on a single hidden variable. Visible variables might not depend directly on their shadow-pair, but instead be related to the shadows of nearby relatives.

Not only do the hidden variables represent variable patterns, but they do so in a way that is explicit, rather than implicit. This satisfies the explicit-or requirement. How can the explicit self reference be constructed?

Here I am, searching for turing-completeness again. The truth is, there are multiple meanings of turing-complete. More precisely, something can be turing-complete concerning different domains; it can be complete from different angles. For example, the knowledge representation could be turing complete but the learning algorithm might not be. Turing-complete could mean a visible or hidden turing machine. Most importantly for the current situation, a hidden turing machine could manifest its pattern in different ways. The form of turing completeness that I considered before (calling the calculations "scratchwork") is turing completeness of the function from the past to the present. The present may be determined by arbitrary calculations involving the past. This may or may not be equivalent to the more standard idea of a turing-complete learner (from the works of Solomonoff and Hutter). The assumption there is instead that the world is the output of an unknown turing machine. The scratchwork idea makes more sense to me, because that sort of knowledge is more generalizable; the same rule is applying itself in many places (throughout time). (Actually, since the mechanism works just as well for space, it isn't specifically limited to time.) In the more standard definition, the only repetition is the recursive calculation of the end result; this is not necessarily observable in any dimension of the data, existing instead only "in theory" (reminiscent of the way scientists sometimes claim that quarks or strings or extra dimensions are "mathematical abstractions" rather than physical fact). Learning this sort of turing-complete model sounds harder (although both are technically incalculable thanks to the halting problem), and so I would think that their sense of turing-complete is "stronger" than mine (more general).

Stronger doesn't necessarily mean better, however. Another interesting sense in which an AI can be turing-complete is that it can learn that the world behaves as any turing machine if taught properly. This sort of AI is more or less useful depending on what "taught properly" means; if an extremely narrow definition is chosen, any programmable computer satisfies the definition. Slightly more interesting is the definition "teaching = explaining how the world works in some formal notation". This does not include all programmable computers because computers accept programs, not descriptions of the world. A system satisfying this more strict definition needs a planning/deduction system, to turn the knowledge into behavior. If teaching properly involves only exposure to a learnable world, then it's back to Solomonoff. But somewhere between Solomonoff and planning systems, there might be some useful possibilities. I've gone too long in this, so I'll just say that John Andreae's PurrPuss learns turing-complete models if taught with something called "soft teaching", which is similar to a parent directing a child's actions (telling the child what to do, but without telling it why). This might be another direction in which I can do a more exhaustive search of the easier learning domain, beyond merely "turing complete": quickly learn weaker senses of turing-complete, but still try to learn Solomonoff-style with a very slow search.

## Friday, August 17, 2007

Anyway, that explains the neglect of this blog.

But now, I've come up with a much more satisfying way of extending the theory to learn turing-complete patterns.

The system can learn the behavior of the "internals" of a turing machine, because a turing machine on the inside is a simply-implemented thing. The power of computation comes from applying simple rules over and over to get possibly complicated results. As I've said before, the problem comes when the internals of such a complex process are hidden from view, and the system must learn the resulting behavior without being able to see the causes. In this case, my system by itself may not be strong enough to learn the general pattern-- it could only memorize special cases.

The new solution to this is based on the idea of a "hidden variable". Hidden variables come into play both in what's called a "hidden markov model", and sometimes in Bayes Net learning, and in other places (I imagine). A hidden variable is one that is assumed to exist, but whose value cannot be directly observed through the senses, only inferred from them.

Hidden variables are obviously what's needed to make my system representationally powerful enough to define any pattern. The question is, when should a hidden variable be introduced to explain an observation? The system should be able to learn not just individual hidden variables, but infinite spaces of them; this corresponds to the theoretically infinite tape of a turing machine, and also to the very human tendency to reason about space as an essentially infinite thing. When thinking about the world, we don't assume only areas we've been to exist: we freely imagine that there is some area at any distance, in any direction. This corresponds to an infinitely extendable system of hidden variables.

Actually, the "infinitely extendable" part isn't what's hard (although it leads to some obvious hard computational issues). Once I figure out what conditions lead the system to consider the existence of a hidden variable, infinite systems of hidden variables can be inferred from the already-existing system whenever a pattern behind the finite number of hidden variables implies such a system. So the important question is when the system should create hidden variables.

Ideally, the system might first create hidden variables to represent phenomena such as an object still existing if a robot moves its camera away and then back, then create hidden variables to represent things staying the same when it moved around the room and back, and to a different room and then back... and when the world is what is moving, then hidden variables might represent what's literally hidden behind other objects. That last one is interesting, because it *could* be done with correlations between non-hidden structures, but it obviously *shouldn't* be. A pixel being in a particular state before the leading edge of an object covers it correlates with it returning to that state when the trailing edge of the object uncovers it; however, it seems far better to correlate it with a hidden variable representing the color of a particular object that can only sometimes be seen. This is particularly important if hidden objects are interacting; if they are represented as hidden objects, the system has a chance of figuring out what the result of the interaction will be.

## Tuesday, June 12, 2007

I've figured out a way to make it do so, although it's not particularly pleasing. Two modifications must be made. First, aggregate definitions must be able to contain alternatives. This is an explicit version of what happens when multiple items occur in a paticular context. Second, definitions must be able to be explicitly recursive, containing the object being defined; this is an explicit version of what happens when an object is its own context.

Interestingly, we might require that these two things can only occur at the same time-- a recursive call only occurs in an alternative, and an alternative only can occur if it contains a recursive call. The first of these two will almost always be true anyway (otherwise all instances of the object being defined will be infinitely long; this could only be true of concepts such as "string of text", and the system doesn't really need to come up with the concept of an infinite string of text, although it seems human). The second restriction does not bar turing-complete learning (as far as I know), so it doesn't change the representational power of the system; but it still would probably hurt the system's power.

There are a few modifications on this theme. Firts, of we're allowing alternatives, then why not allow denials, too? This seems like it would make a good deal of difference for some purposes and not much for others. It would make the system more difficult to compute-- rather than identifying an aggregate when we see all of its parts, we would not need to worry about seeing a lack of certain things. This is more troublesome than it sounds, because in many cases the system only knows what is, not what isn't. (Imaginary cases, of course; this is how I visualizew thew system, but that doesn't mean much about what it will be like in practice.) A denial of an abstract entity requires us not just to fail to find an instance of it currently instantiated in the location in question, but to find no instance after instantiating all possible entities in that location. Since the system is now supposed to be turing-complete, this runs into the halting problem. Ideally the system should be able to guess after looking for a good amount of time: "It doesn't seem as if there is an instance there, although I can't be sure."

Another variation: Most aggregates with explicit alternatives or self-reference should come from a recognized implicit alternative or self-reference. Can I change this "most" to "all"? I think the following version works: "All aggregate classes with explicit alternatives or self-reference either reflect aggregate classes with implicit alternatives or self-reference, or are formed by reducing the number of alternatives in such a class." The problem with the implicit alternatives is that they are too all-encompassing, embracing anything that occurs in the defining context. Making such alternatives explicit and then reducing the number of alternatives fixes this. Is that good enough? More thought is needed.

## Wednesday, May 02, 2007

In a way, it seems silly to try to learn unrestricted grammars. What does it mean? The simplest characterization is: sort through the space of all possible programs in some manner, looking for the ones that output the data so far (but which don't stop there). Use the further output of these programs to predict the future. (Either choose one that's somehow best, or somehow assign different weights to each, or treat them all equally.)

This sort of patternfinding is on a level we wouldn't even expext of a human. The output of an arbitrary computer program can be pretty hideous-- completely tangled, with every value dependent on every other. Yet, at the same time, scientists can go further than the class of "unrestricted" patterns, into the incalculable. The class of unrestricted patterns is actually defined as any calculable pattern. However, not all equations in physics are calculable. We suspect that the equations hold in all cases simply because they seem to hold in the cases we can calculate. So perhaps the "unrestricted" class of patterns isn't exactly the goal I should be after-- a general learning algorithm shouldn't be expected to learn all patterns of the unrestricted class, but it should be able to learn some that don't fall within it; patterns that are calculable only in some cases. (Here's an image of what it means for a pattern to be "incalculable": there is some sort of exact definition for it, but this definition does not imply any sort of algorithm for calculating the pattern; the only way to work out the pattern is with brute-force theorem-proving to get algorithms for any special cases that you need to know, and no such theorem-proving can ever cover all cases. Neat, huh?)

Actually, I suppose it's possible (in principle) to construct computer programs that will compute uncomputable functions. They'll simply run forever if you ask for an incomputable case. This suggests, then, that a learner of computable unrestricted patterns may very well learn some partially uncomputable patterns by accident, because it can't go through and check to see if a function is computable for every case. Neat.

But back to the more important issue: what would learning unrestricted patterns look like? Searching through the space of all programs that produce the data doesn't provide a very good image, because it's not very dissectable. For example, it doesn't allow us to ask questions like "if this past history had occured instead, what future would you predict using the same patterns?" I think a slight improvement, without loss of generality, would be to search for possible programs that make fairly good predictions of the next moment when fed the last. This can be thought of as a "computed markov model"-- rather than using a table of probabilities, use an algorithm. Obviously, stopping at 1st-order computed markov models would be a bad idea (because in real life all sensory input from one second ago will not tell you everything you need to know about the next moment by a longshot)-- the higher the order, the better. In fact, it might be desireable to search for programs that make fair preductions when fed arbitrarily-sized chunks of the past leading up to various sample moments.

A penalty for complexity, however, is also to be desired. Otherwize the program can just memorize all of the answers.

Anyway, this is an easier question to ask about the system I've been discussing: given enough data, can the system learn arbitrary algorithms for calculating the next moment from some portion of the past?

It is interesting to note that, simply because the system can learn regular grammars, it can actually learn to predict the next step in a turing machine, possible next steps in a proof, et cetera. That is, it can predict arbitrary algorithms if it is allowed to see all of the "work" inbetween input and output. However, obviously the world hides its work from the observer. So can the system learn to calculate "hidden work"? This would mean something like treating the past as an algebraic statement, and performing manipulations on that statement "off to the side" until some reduction to a particular form is acheived. Then, that form must recognized as a sign to stop, and translated (via regular means) into the answer.

Since the system could in theory learn any such "algebraic manipulations" as a new type of relation (not found in the dataset), it could perform the off-to-the-side calculations. The question is: could it learn to do so? Also, could it recognize the halt-state properly?If the answer to these two questions is yes, then it could learn the arbitrary transforms.

To recognize the halting state, and utilize the result of the performed calculation, I think it would only need context-free means. All that's necessary is to recognize the arbitrarily long calculation as a single object with an answer at the end, and store the learned fact that this answer corresponds in some simple way to the future. To recognise all the scratchwork as a single entity requires only a simple recursively defined entity: scratchwork = valid-step -> scratchwork. So can particular sorts of scratchwork be learned?

For the current system to learn some sort of scratchwork without me realizing it, it would need to support some sort of "classification cascade": a few seperate classifications can be combined to form a new classification, which can combine with others to form yet more classifications, resulting in an arbitrarily long series of additional classifications resembling a proof. Each new fact may aid in deriving yet more facts, and the cascade can go on for indeterminate amounts of time.

For this to be the case, there needs to be an infinite number of possible classifications. (Otherwize, a cascade couldn't go on for very long.) This is workable, because I allow for classifications of classifications-- and such a meta-classification can potentially be a catch-all for an infinite number of classifications. However, in it's present state, the system can't fully utilize all such classifications; the meta-classifier can be used to identify new classifiers as instances of the infinite class, but still each such new classifier must be created individually; they cannot be created on the fly. An interesting problem. To solve it, we need "anonymous classifiers" (classifiers that we don't bother to give names) defined on-the-spot for entities; all possible 1st-level classifiers that would say "Yes" for the given object are searched through, but without actually bothering to instantiate them, and we test the meta-classifiers on all of them. If one satisfies a meta-classifier, then we know that the object falls into a particular infinite class. (This could be recorded in various ways, I suppose.) A similar procedure must be applied for meta-classes once we've got metametaclasses, and so on.

This sounds a bit like gobbledygook, because I didn't fill in any details at all, but I think the general idea is sound.

Anyway, all of that still fails to guarantee that arbitrary scratchwork could be learned. The fact that I can have some infinite classes of classifiers doesn't even quite guarantee the sort of cascades I want. I've basically got to prove that the classifiers act like statements in predicate calculus. Each can be arbitrarily long, being (in a proper sense) composed of smaller statements, and interesting rules of deriving one from another can occur (meaning arbitrary production rules). After that, I've got to prove that all of this can occur through learning, and that the rules learned are in some sence optimal! Sounds difficult.

So how complicated can classifications get? In classifying classifyers, the system can use any methods it would use for classifying ordinary data. So assuming my previous arguments are correct, that means a class of classifiers can be at least context-dependent in complexity. But this is concerning the sequence associated with the data it represents; it shouldn't be confused with the complexity of the "logical statement" it represents. Can a class represent the intersection of two other classes? (yes.) Can it reppresent the union of two other classes? (Yes, provided there is some reason to form this union, such as a common context or property of any kind. That restriction may be fatal, I don't know. It seems reasonable given that this stuff has got to be learned.) Can it represent the negation of another class? (Hmm. Sounds like a good question. It depends on how I generalize markov models to arbitrary graphs. Is a subgraph defined soly by a set of relations it does contain, or can it be defined also by a lack of particular relations? I don't think allowing definition by a lack of relations would cause any trouble...) What about quantifiers? (Very good question. But what would it mean? "For all classes X, this class suchnsuch"? No. It wouldn't be quantification over classes, it would be quantification over... properties? Quantification in predicate statements isn't over smaller statements, it's over variabbles that make up statements... but it can be interpreted equivalently either way... so yes?? Joint denial over a class of statements (classes) formed by a variable-substitution scheme is all that's necessary? So if I can say such things as "for all classes A, if A satisfies (stuff including the variable A) then this class satisfies (other stuff including the variable A)" I'm good? Hmm...)

Of course, predicate calculus is only one example. If the classes can arrange themselves to imitate any turing-complete system, and learn such arrangements based on data, then I've got what I'm looking for.

## Thursday, April 26, 2007

(A regular grammar is quite minimal, only a step better than rote memorization, then a context-free grammar is less restricted, and powerful enough to represent most formal languages, but the least-restricted grammar is the class of turing-machines. A regular grammar allows repetition, rote memorization of sequences, and picking from alternatives. A context-free grammar additionally allows recursively defined concepts-- for example, the class "(), (()), ((())), (((())))..." can be recursively defined as either () or (Q), where the Q stands for any member of the class. The ability to recursively define in this manner still leaves some constraints. I won't go into that here, mainly because I'm no expert. Look up regular grammar and context-free grammar on wikipedia.)

Here's the argument. Suppose that a recursively defineable entity occurs in some context, which we will call context A. Trying to learn what happens in the context, the system will first memorize the base-cases as well as some small constructs, without realizing the larger pattern. Then, the system will recognize the base-cases within the small construct's structure. As it keeps learning, it will memorize some larger constructs, and notice the smaller constructs within the larger structure. The smaller structure's locations will be exactly analogous to the locations of the base-cases within the small cases. With enough experience, the system will relize this, and represent the fact with an aggregate reflecting the structure of the construct, having some blanks filled in not with specific sequences or with alternatives, but with the simple requirement "anything that can occur in context A can occur here, with the same probabilities." Because all properties of a class become properties of instances, the system can notice this repetition and take advantage.

Not only does this allow recursive concepts, but it does so without using recursively defined classes. This is good for simplicity of calculation. If the final aggregate class I mentioned filled in its slots with the requirement "another instance of this class" (as recursive definitions usually do) then it would be difficult to assign a probability to the aggregate. But since it instead refers to itself indirectly with "anything from context A", there's no problem; the definitions of each class (and therefore the calculations for their probabilities) is non-recursive, even though it all adds up to a recursive concept.

So (if you've read the last few posts) it seems that my system can learn regular grammars, markov models, hidden markov models, and context-free grammars. (Formal proof of each thing is admittantly wanting, especially for context-free grammars, but I'm fairly sure.) But can it learn unrestricted grammars?

## Wednesday, April 25, 2007

(In a way, this is silly. Every time I get ahold of some real information about the field, I realize how much I've merely reinvented in primitive, approximate form. The current realization is from a book Dr. Albee lent me.)

A Markov model is a table of transition probabilities between states. This is a somewhat simplistic method of patternfinding, because it is making entries in a static table. Essentially, it's a probabilistic way of recording rote repetition. In fact, if we record the probability of each sequence of some set size, we can easily use these numbers to calculate the table entries in the markov model. These sequences are what I called "aggregates" before- so finding aggregates is precisely equivalent to finding markov models, except that certain things are easier to calculate if you've been collecting aggregates while other things are easier if you've been calculating markov models.

But, somewhat more exciting:

Whereas in a markov model, one assumes that the next state of the system depends completely on the previous states, in a hidden markov model, one instead assumes that the current state depends soley on the state of a "hidden" variable. The hidden variable follows the markov assumption: it's next state is determined by it's previous states in some probabilistic manner. But the observable variables are assumed to be dependant on this hidden variable, rather than on their own past states. It's like watching a markov-system through a frosted window; we can only determine the state of the hidden variable by observing what we can see. We must learn both the markov-model that governs the state of the hidden variable, and the probabilities of different visible-variable states given each hidden-variable state. We can't actually learn one without already having some guess about the other, so learning hidden markov models is pretty difficult.

What's cool is that "variable patterns" are equivalent to hidden markov models. A variable pattern is characterized by an unknown (but guessed-at) distribution of observable state probabilities. The system must attempt to learn transitions between variable patterns at the same time as trying to learn the variable pattern's definitions. (This is reflected by the fact that aggregates can be formed completely of variable patterns.) My system tries to learn hidden markov models automatically.

My system, therefore, has different elements that are equivalent to two related patternfinding methods. It serves as an interesting go-between; it learns regular markov models and then attepts to learn hidden markov models on top of 'em. I should research further into the issue to see if there are similar things out there. However, this doesn't completely characterize the system... at least I think it doesn't. In relating the system to markov and hidden markov models, I didn't mention all of the five points from the conclusion in the last post. I'll have to think about it.

## Thursday, March 29, 2007

Basically, "2a" from the previous post was unnecessary. It will hapen anyway, as a result of carrying out the other operations. It might help things along to include it seperately, but then again it might make things needlessly complicated.

Here's how it emerges on its own:

Remember, 2a is: view the context of an object as a property of the object. This means that in addition concrete properties, such as color, shape, or whatever it may be, an object also has a "normal context" that it appears in, and this is just as important a property. When finding patterns, the system must look for predictors of redness, curviness, or whatever it may be, but also should watch out for predictors of these context-variables. This allows the system to notice if several objects that have appeared before in a particular context suddenly start appearing together in some new context, too; it can then predict that any object that might have appeared in that old context is prone to appear in the new. (This is perhaps a clearer explanation then I made in the previous post.)

The reason this isn't necessary is because we're already recording contexts of objects and finding patterns in them (#2). What we're really doing when we decide that objects appearing in the new context have a common old context is deciding that the new context and the old context appear to share properties; in particular, that they appear to share at least some items. Our probabilistic model of the behavior of contexts in general will then start making conclusions based on this similarity, which is what we want.

That makes the reduced model as follows:

-record probabilities of objects (objects being aggregates of the basic sense-pieces). These records are called "models", and are used to predict unknown data, and possibly to remove static from known data.

-interpret data by recognizing learned objects, and treat these interpretations as data (finding patterns on the objects in the same way one finds patterns in raw data).

-look at the models as data as well, attempting to predict them and remove errors in the usual way. (Even if it isn't considered important to remove static from raw data, it's unquestionably wise to attempt to remove errors from models; models definitely have potential errors, because data is almost never entirely unbiased.)

-Record the probabilities of various contexts for each object.

-Treat these context-models as data in the same way.

This can probably be simplified even further in various ways. For example, the second point might be simulated by looking for larger and larger objects with the first point, and then looking for patterns in these large objects to simplify them, and compensating in various other ways. But this would probably not be an improvement in the clarity of the system, and other possible simplifications that occur to me have the same problem; they make the abilities of the system less directly related to the immediate behavior of the system. But, of course, of a simplification is sufficiently simplifying, it's worth it.

Hey, never mind, 2a is totally necessary: a new object might be found that is constructed entirely of context-properties. For example, we might define a class of pictures that consisted of a drawing of a fruit with a drawing of some object from an office sitting next to it. Neither "fruit" nor "object from an office" are defined by visual characteristics. This may not seem like a particularly critical class of pictures for a person to be able to learn, but the fact is that a person could learn it. It is possible to learn objects composed of items that are defined only by their normal context.

Now that that reduction is invalidated, let's see if any others are possible. Can 2a replace the basic search for patterns in context-spaces, then? No. 2a relies on this search. Without it, context-spaces would be a disarray of unrelated stuff; there could barely be a 2a.

One idea comes to mind. Recording the probabilities of different aggregates and recording the probabilities of different contexts for singles or aggregates is essentially the same thing. If I had all context-probabilities for, say, a yellow pixel, I could calculate the probability of all aggregates containing yellow pixels. Likewise, if I started with all those aggregate probabilities, I could calculate the probability of each possible yellow-pixel context. This suggests, of course, dropping one or the other process. What makes me hesitant is that, although it is clear that the two are equivalent, it is not so clear that the same sorts of patterns can be found in both. The basic benefits of the two processes might be similar, but the benefits resulting from modeling each may be different. Recording contexts, for example, provides the material that is used in the step 2a we've been discussing. Can recording objects do the same?

The added variables of an object would not be context, then, but rather object-involvement; which larger objects the smaller object could fit into. (As with context, this added variable doesn't represent what the object is currently involved in, but what it can be (has been) involved in. It's a class property, not an instance property.)

This is obviously perfectly equivalent to using contexts as variables. However, there is one more consideration: what about the patterns found in contexts? Are they the same patterns that can be found in objects?

Some of them are. For example, it's possible to find smaller, repeated sub-objects within larger objects and across larger objects. A good example-- children actually learn word meanings before syllable meanings in some cases, only later realizing that particular syllables seem to have consistent meanings across words. But there is another type of pattern that doesn't seem to be covered: consistencies in probabilities of various contexts. When two different patterns occur repeatedly in similar situations, they should be deemed "the same" (so-called variable patterns or quotient-objects). This can be taken care of, however, simply by realizing that (1) I've already declared that the larger patterns an object occurs in are class variables, and (2) I've been working under the assumption that the system looks for patterns between classes.

Everything seems to add up.

Reduced system:

(1) We've got data made of basic sensory percepts linked up in some way. Each basic percept is represented by a class.

(2) We record the probabilities of various aggregations of percepts in some way. Each possible aggregate (or at least each aggregate that the system decides is worth noting) is a class.

(3) Interpretations of the data are created using the aggregates, replacing lower percepts with aggregate labels (corresponding to the aggregate classes). These interpretations can then be treated as data, and all properties of the classes are available as variables of the labeled aggregates.

(4) If a class occurs as part of a larger aggregate that is recorded, then that fact is recorded within the class; in other words, the likely contexts of a class's instances are a variable of the class.

(5) The entire range of classes is also treated as data, which the system tries to find patterns in.

Well, the discussion of the reduction turned out longer than I'd planned, so the other part will have to wait until later.

## Tuesday, February 20, 2007

Well, I keep progressively coming up with simpler ways to describe the whole system, which is good. But I've also figured out that the system cannot learn any arbitrary model in the turing-complete sense, which is bad. So I'll be revising the whole thing to be more complicated soon (if I think of a solution!). But, for now, here's the shortest-yet description, which simplifies matters by using only objects rather than objects and logical rules, the two being equivalent anyway.

The data is represented as a graph with labelled directed edges-- for those who don't know graph theory, that's a network of points connected by arrows that come in different types. The different types of arrows (a.k.a. the different labelings of the edges) represent different logical relations between the nodes, which represent single things (of a general sort).

These relations are divided into spatial relations and property relations; spatial relations point to other data-objects (the archtypical spatial relation being "next-to"), while property-relations point to nodes representing values of properties. If the data is a digital photo, for example, the spatial relations would be "above", "below", "right-of", and "left-of". The property relations would be "red" "green" and "blue", and would point to non-data nodes representing every number from 0 to 255. (These number nodes would have their own spatial relations: one-bigger-than and one-smaller-than.)

-----------

1:

The first operation on this space is to make models of it. These models are constructed by learning objects in the space. "Learning objects" means recording probabilities for subgraphs (sub-networks). These probabilities can be recorded in various ways. The roughest way would be to use some threshhold, above which an object is considered "salient"-- thus, a model is a list of salient objects. (I've developed some math for deciding this threshold, but I won't go into it.) We can keep more information by actually recording probabilities for objects. We can keep even more by recording probability density functions (which give a probability that each possible probability value is the real probability value). The system works about the same regardless of choice (it just takes longer to process things and gives more accurate results).

1a:

Once we construct objects on a data space, we can create a new space in which these objects are the data, and objects on this space can be learned. In other words: once we've used the basic data as peices to put together larger objects, we can use these larger objects as pieces to put together even larger objects, et cetera, iterating on itself continually. So the "first operation" loops back on it's own output.

1b:

Models are also considered new data-spaces, the probabilities being the new properties to be predicted. This means that if the system has learned models for multiple comparable data-spaces, it can compare the different resulting models to try to find similarities. Based on these similarities, it can do such things as revise possible inaccuracies in the models (based on discrepencies with the general model patterns) and predict likely models for new spaces. Thus, there is a second way in which the modeling process wraps back on it's own output.

2:

The second operation we can do is to record information about the context in which each of our learned objects appear. Like recording the probabilities of the objects, this can be done in different ways. The most convenient, given (1a), is to record information about what higher objects our object finds itself appearing in, and with what frequency. These records of context serve as new data-spaces in which to find patterns. The patterns found there can be used to revise the likely contexts of objects, as well as predict likely contexts for objects where little evidence has been gathered (the likely contexts of objects that occur rarely). (1b) applies here, meaning we compare the different models for the different context-spaces to find patterns.

2a:

It may be helpful to view the context of an object as a property of the object, which is acessible to the modeling process in data. This means that proccess 1 can build objects using context as a property in addition to the regular properties. Rather than predicting the direct properties given a particular situation, the system predicts an object with particular contextual properties, which in turn allows a prediction for the actual properties of the object. This is useful if there is some pattern between the contexts in which a pattern occured before and the contexts it's occuring in now, other than the obvious "it should still be the same" pattern; in other words, if some transformation has occured mapping the contexts to a new distribution.

-----------

And that's it. Well, not quite. I skipped over some little details.

Subscribe to:
Posts (Atom)