Friday, June 27, 2008
I ran into some interesting parallels...
This paper gives a method for finding programs that could have generated a given piece of data, by simulating a universal Turing machine running backwards from the data to the possible programs that generated it. Since a Turing machine can destroy information while running forward (for example, if it writes 0 to a part of the tape then we no longer know if there was already a 0 there or if there was a 1) running it backwards involves making arbitrary decisions (such as what to put on a square that is reverse-written-to). These different decisions will lead us to different possible programs that might have created the data. We should look for elegant programs, because these explain the pattern behind the data. But of course we do not know which paths lead to the most elegant programs, so we need to search.
The search space is rather large, but it is a huge improvement over trying to search the entire space of programs. Contrast this to Jurgen Schmidhuber's earlier approach, which throws out a program as soon as it produces a wrong output. The new approach avoids the incorrect programs altogether, so that all we need to worry about in our search is program search.
Backtracking is also easy. All we need to do is run the simulation forwards. Since the Turing machine is deterministic, we don't need to make any decisions. We can keep going forward and backward as many times as we like and still remain in the space of programs that produce the desired output.
My interest in all of this is adapting it to learning grammars. Since grammars are turing-complete representations, this is not difficult. What is interesting, though, is that when translated, there is a strong analogy between this new method and old methods for grammar learning. The old methods are restricted to a special case called context free grammars, so the new method is basically a generalization. To use a grammar, we apply its rules successively until we get the final product. The old methods for finding context-free grammars are interesting because what they do is essentially make reverse rule applications to construct a grammar from the end result. The analogy between this and running a Turing machine backwards is quite strong. Those methods for learning context-free grammars are also quite analagouse to the crazy ideas in my earlier posts on this blog.
I am interested in examining how these methods can be adapted in various ways, including taking some ideas from SEQUITUR to attempt an incremental version of the algorithm, and ideas from this paper to better guide the search (at least for context-free cases). That is, if I find the time...
Monday, June 23, 2008
In the previous hyperlogic post I described loosely the arithmetical hierarchy, and a "hyperlogic" to deal with it. Although I didn't describe this in detail, the definition of this hyperlogic was inspired by a particular set of hypercomputers that can calculate functions in the arithmetical hierarchy.
The particular hypercomputers that I was thinking of are infinite-time turing machines, but restricted to change each output square only once during each infinite subcomputation. An infinite subcomputation is the size of the first ordinal infinity, whereas the computation as a whole may be logner (and thus have many infinite subcomputations); the number of infinite subcomputations is determined by the level on the hierarchy. (Additionally I need to stipulate that the output squares can start out either all black or all white depending on if we are calculating an existential (assumed false) or a universal (assumed true).)
So, to extend the hyperlogic further, it seems as if I need a more powerful hypercomputer to use as a basis.
After the arithmetical hierarchy comes the analytical hierarchy. The shift is from 1st-order arithmetic to 2nd-order arithmetic. (It is important to distinguish this from 1st-order logic and 2nd-order logic. 1st order logic is complete and doesn't need hypercomputers to decide its quantifiers, while 1st-order arithmetic is incomplete and does need hypercomputation.) In the arithmetical hierarchy, we are only quantifying over integers. In the analytic hierarchy, we quantify over sets of integers. The space being quantified over is infinitely larger, in a way that makes the previous methods fail.
The class of infinite-time turing machines I was considering uses only countable infinities of time. The analytic hierarchy requires quantification over an uncountable set, meaning it needs uncountable infinities of time.
An uncountable infinity of turing-machine steps may seem strange (or worse, incoherent), but if we ignore the details it works. It doesn't even matter what order everything happens in; since all I need is existential and universal quantification, I can just say that the machine stops and returns true (/false) if an example (/counterexample) is found, and otherwise keeps going until it has checked everything (at which point it returns the opposite truth value).
I do need to specify further how "checking an individual case" works. But this is not too difficult. Quantifying over sets of numbers gives us in each case some predicate that returns either true or false for each number. If there are more than one such quantifiers, we get more than one predicates. Ignoring the inner workings of the predicates, we just use them as given in what is otherwise a normal case of arithmetical hypercomputation.
Now, can we convert this to a finite process that in some sense captures the concept?
My tentative proposal is: look at a generic case of the problem, converting all set quantifiers to undefined predicates. Run the decision procedure for this arithmetical statement. Every time we need a particular value of one of the predicates, split the computation; run one version with the value of the predicate being "true" for the number it was given, and a second version with the value being "false". The result (after all versions of the arithmetical decision process terminate) will be a table with the number of dimensions determined by the number of set quantifiers, and the number of entries determined by how many versions of each quantifier ended up being needed. In any case we can attribute an approximate truth value to the statement by deciding each quantifier in order based on the examples examined.
Now, I am less sure about this method than I was about the arithmetical method. It seemed that improvements to the arithmetical method could never cause a serious increase in power, so that the method was within the realm of "the best we can do". However, for this method it is not clear that the machine is doing the best that can be done. In particular, I am not sure that adding direct 2nd-order theorem proving would not seriously increase the power.
Thursday, June 19, 2008
http://research.microsoft.com/~minka/statlearn/glossary/glossary.html
A long list of AI-related terminology with definitions, and lots of links to related websites.
http://tldp.org/HOWTO/AI-Alife-HOWTO.html#toc1
A really long list of AI for Linux.
http://www.math.hawaii.edu/~dale/godel/godel.html
A nice description of Godel's theorem and other results in therms of the diagonal argument.
http://www.cs.berkeley.edu/~bhaskara/alisp/
This is Alisp. Alisp is, in a sense, a very basic AI system (since it employs simple learning methods, with little potential for recognizing complex structure). However, it is very interesting from an interface point of view. It allows the maximum possible use of the simple AI it contains. Therefore, I think the idea of using Alisp as a starting point to build up a more complicated AI is interesting (since I am into the idea of recursively building up complex capabilities from simple ones).
Wednesday, June 18, 2008
In a previous post, I mentioned Sergei Artemov. I've now read a couple of his papers there. His work is based in extending one particular modal logic, developed by Godel. Godel's logic introduces a "provability" operator. (More specifically, it takes the meaning of the modal "necessary" to be that a statement is provable.) Part of the purpose of this logic is to pin down the meaning of intuitionistic logic. To translate an intuitionistic statement into a classical statement, we add the provability operator in front of the entire formula and in front of every subformula. Take [] to be the provability operator. "P and Q" stated in intuitionistic logic translates to []([]P and []Q), which means "it is provable that both P and Q are provable." "P or Q" would translate to []([]P or []Q), "it is provable that either P is provable or Q is." And so on for larger statements. The strange altered of intuitionistic logic naturally arise from this. (It is also possible to add slightly less [] operators, interestingly.)
Artemov's main contribution is to provide an explicit provability logic: instead of just containing an operator "x is provable", we can say "y is a proof of x". This can be represented by [y]x (we just fill in the box that was left empty in Godel's original version). Godel was interested in finding such a logic, but had not provided a complete set of axioms for it. The explicit provability has some nice formal properties that implicit provability lacks, and those properties help to prove some tighter relationships between various logics.
Monday, June 16, 2008
A Basic Hyperlogic
My search for the proper logic is guided by the requirement that desired meaning of the logic must be reflected totally in the rules of inference. This is a form of something called grounding. The idea behind grounding is, roughly, that in order to have a meaning, a symbol needs to have a proper connection to what it actually refers to. This is sometimes used in the AI community to argue that robotics is absolutely required for AI, because an AI shunting symbols around on its own inside computer cannot really "know" anything (its symbols have no meaning). This is idea is also known as embodies intelligence, or situated cognition. The form of grounding that I'm talking about is somewhat different: I want a way to ground logical and mathematical concepts. Mathematical grounding must exist, because humans are able to refer to mathematical entities, and understand mathematical meaning. My aim is to figure out how.
I've discussed the inadequacy of current mathematical logics in the past. In that post, I also made some wild speculations about creating a logic based on "limit computers" and even more powerful machines, . Basically, this post is a revision of those speculations based on some solid facts.
Limit-computers, as well as more powerful imaginary machines, are called "hypercomputers". Similarly, I'll call the logics that match them "hyperlogics".
To measure the power of a hyperlogic, I'll use something called the arithmetical hierarchy. Very approximately: the first level of the hierarchy contains all computable facts. (So for example, it contains the fact that 3 + 4 = 7.) The second level contains facts either of the form "for all x, P(x)" or "there exists x such that P(x)", for any computable predicate P. Notice that we may need to check an infinite number of cases to decide the truth of these facts, but we may not. If the fact is a "for all" fact, then we may run into a case where P isn't true; we have a counterexample, and are done. But if we never find a counterexample, we would never finish checking the infinite number of possible values for x to conclude that P was true everywhere. Similarly for "exists": we can stop if we find a true case of P, but we would need an infinite amount of time to conclude that none exists.
The third level of the hierarchy is the same, except that it takes P to be second-level, rather than first-level. This means we need to check a possibly infinite number of second-level statements to decide a third-level statement. (But remember, we may never finish even the first of these, since checking a second-level statement can take forever.) Similarly, each level contains those statements that can be verified by looking at a (possibly) infinite number of statements on the previous level. (Nice fact: each higher level can be decided by an infinite-time Turing machine given one more infinity of time.)
Mathematical truths beyond the second level cannot be logically determined, because no logic can check them. Yet humans can talk about them. So, the goal is to specify a logic that tells us how to reason about facts on these levels. The logic cannot possibly tell us whether these facts are true or false for sure, beyond facts of the 2nd level. I'm just looking for a logic that does the best it can. If we do the best we can to try to ascertain the truth of a statement, that should ground us, right? Intuitively, the idea is that some external observer looking into our thoughts (or the inner workings of our AI) should say "Oh, this symbol here must represent mathematical entity X, because it is manipulated the way entity X should be manipulated." Ascribing that meaning to the symbol is the best way of explaining the manipulation rules attached to it.
So: for 1st-level statements, use the method of computation. For 2nd-level statements, keep checking 1st-level cases until a deciding case is found, or until we're satisfied that there is none. (This reflects the scientific method: if a hypothesis escapes our best attempts to disprove it,we conclude that it is true.) For a third-level statement, test the 2nd-level cases until a decisive case is found, or until we're satisfied there is none. (Notice: the meaning of "decisive case" is weaker here, because our conclusions concerning 2nd-level statements are not entirely perfect.) And so on. This gives us an approximate decision of the truth of any statement on any level of the arithmetical hierarchy.
It may seem like I've left out some important details. To fully specify the decision method, I should specify how time is to be distributed between cases when checking. For example, on the third level, I could check many 2nd-level cases shallowly, or I could check a few deeply. Also, for higher-level statements, since the so-called decisive cases are not entirely decisive, it might be worthwhile to seek out more than one. If several were found, how should we represent the increase in certainty? These are important questions, but not too important. The statements in question are fundamentally undecidable, so there can't be a really right answer. Any particular technique will fail in some cases. The only thing that can definitely be said is: the more time we spend checking, the better. So I feel I am somewhat justified in leaving out these details, at least for now.
This hyperlogic is nice, but there is a reason this post is titled "a basic hyperlogic". The arithmetical hierarchy does not cover all mathematical truths. So the quest isn't over. But I think this is a good start. I've got a concrete example with a good mathematical foundation, showing that it is possible to create systems more powerful than 1st-order logic for AI.
Wednesday, June 11, 2008
Bayesian Convergence: What it Will and Won't Do
In my continuing quest for the proper prior, I happened upon a nice result: under very general conditions, the effect that the prior has on the current belief will vanish as evidence accumulates. Nice! This means that a Bayesian learner will be good regardless of the choice of prior-- the learned beliefs will fit the evidence.
Does this mean the search for the correct prior is needless?
To answer that question, I should first give the convergence result in a bit more detail. (To get it in full detail, see this paper.)
The key assumption that is made is that no model of the world has zero probability. Bayesian learning will never increase the probability of such a model, so this makes sense. The convergence result is most easily understood in the language of likelihood ratios. In this version of bayesian learning (which gives the same end results as other versions, just by different intermediate steps) we start out with the "prior odds," rather than the prior probability, of a model. The prior odds of a model is just the probability for that model divided by the probability against. For each new bit of evidence that comes in, we multiply the current odds by the "likelihood ratio". The likelihood ratio is the probability of the evidence given the model, divided by the probability of that evidence given its negation. (The probability of the evidence given the negation is actually the sum of its probability given each of the other possible models.) Now, as we observe more and more evidence, we multiply again and again to update the odds for each model we're considering. Yet the prior odds remain a constant at the beginning of that long line of multiplication. The odds of a model, then, can become as large as they like regardless of prior, and likewise can become as small as they might. The evidence is what matters, not the prior.
In any case, I am not about to drop my concern about which prior a rational entity should choose. The main reason for this is that the convergence result leaves open the question of the class of models to be considered, which is my primary concern. Even if this were settled, however, the convergence theorem would not convince me that ensuring nonzero probability for each model is sufficient. The reason has to do with predictions.
To make the case as extreme as possible, I'm going to ignore probabilistic models, and only consider deterministic ones. This is actually no restriction at all; any prior over probabilistic models could be seen as a fancy way of specifying a prior over totally deterministic ones. A probabilistic model gives a probability to each possible dataset, and a prior over many probabilistic models can be seen as just a weighted sum of these, giving us a new (possibly more complicated) distribution over the possible datasets. This can be used as a prior. In fact, since it gives the same overall probability for each dataset, it is for practical purposes the same prior; yet the models it considers are deterministic.
Now that we're working with completely deterministic models, the data will either fit or not fit with each model. When it doesn't fit, we throw that model out. The convergence theorem still holds, because the set of models we're considering will keep shrinking as we throw more out; whenever this happens, the probability that belonged to the discredited model will be redistributed among the still-valid ones. Thus the probability of the correct model will continue to increase (since it's never thrown out).
However, this is not much comfort! The relative probabilities of the models still in consideration will not be based on the evidence at all; it will still be based purely on the prior. (The probability from a model that gets thrown out is redistributed, but not evenly.) This means that when we make predictions, the prior is (in a loose sense) the only thing that determines our prediction.
In fact, if the prior assigns nonzero probability to every possible dataset, then the set of models not yet ruled out will contain all possible futures. The only thing that can narrow this down to make a useful prediction is the prior, which may or may not do so in a way dependent on the evidence so far.
Perhaps someone objects: "But then, can't we just require that a prior's predictions do depend on the evidence? Isn't it an obviously silly mistake to construct a prior that violates this?" Unfortunately, simply ruling out these cases doesn't tell us what prior to use. What kind of dependence do we want? I want a prior that can in theory "notice any sort of regularity"; but this includes noticing that the data is just completely random (predictably unpredictable).
In a way, allowing probabilistic models is a very strange move. It's very similar to allowing models that are infinitely large; in a way, a probabilistic model includes information about an infinite number of coin flips, which are used in a well-specified (deterministic) way to decide on predictions. Of course, when we specify a probabilistic model, we don't specify this infinite table of heads and tails; in fact, that's where probability theory gets its power. This is reminiscent of the idea of a "random sequence" being a more fundamental notion then "probability", as discussed in the previous post... but that's enough speculation for today.
Tuesday, June 10, 2008
In my previous post, I mentioned that my "personalized "bayesian frequentist" interpretation of probability was struck down by the news that frequentism conflicts with the standard axioms of probability." The issue needs more consideration, so this post will discuss it exclusively.
To that end, I've found a very interesting article on the subject.
The information I had found previously went something like this. The simplistic frequentist position is to say that the probability of an event is the fraction of times it occurs in some set. This is problematic mainly because if we flip a coin five times and get the fractions heads=1/5 and tails=4/5, we can attribute these fractions to chance. We don't automatically believe that the "real" probability of heads is 1/5.
Revision 1 of the frequentist view changes the definition to "limiting frequency": the probability is the fraction we would get if we kept at it. The limiting frequency does not always exist. For example, consider a sequence containing As and Bs, starting with ABBAAAABBBBBBBB... Each time, the number of same letters in a row doubles. The ratio of A to B will wave back and forth forever, never settling. So by definition, probabilities only apply to sequences for which the limiting frequency exists.
This is better, but it still isn't quite right. There are two problems. First, a coin could land on heads every even flip and tails every odd flip. The limiting fraction for both sides would be 1/2, but this is obviously not a random sequence. So the requirement that there is a limiting frequency is not enough to guarantee that the sequence is probabilistic. Second, it is possible to re-order an infinite sequence to make the limiting frequencies different. With the same alternating heads/tails sequence, we could reorder as follows: group heads together in pairs by moving them backwards, but keep tails isolated. This makes the limiting frequency of heads 2/3. It seems odd that the probability would change when we're just counting in a different order.
To fix this, von Mises defined something called a "collective". Before reading the paper, I knew that a collective had the additional property that any subsequence chosen without knowledge of where the heads were and where the tails were would have the same limit frequencies. I had also read that the resulting theory was inconsistant with the standard axioms of probability. I wondered: if this definition is inconsistant with the standard axiomization, what sort of alternative probability theory does it yield?
What the paper immediately revealed was that the collective-based definition of probability was a competitor to the now-standard axiomization, Kolmogorov's axiomization. It is not surprising, then, that the two are inconsistant with eachother. Where the two differ, von Mises preferred the collection-based account. It is not hard to see why; his account is grounded in a mathematical concept of a random sequence, while Kolmogorov's axioms simply tell us how probabilities must be calculated, without any explanation of what a probability means.
Mainly, the notion of a collective is weaker; it does not allow us to prove as much. For example, it is quite possible that a coin being flipped approaches the correct frequency "from above": with only a finite number of exceptions, the ratio of heads at any finite time can be greater than 1/2, although it gets to 1/2 as we keep going. Perhaps a stonger notion of random sequence is needed, then. But I do think the von Mises approach of defining random sequences first, and then probabilities, seems like a better foundation. I wonder: is there any definition of randomness from which the Kolmogorov axioms automatically follow?
Saturday, June 07, 2008
My personalized "bayesian frequentist" interpretation of probability was struck down by the news that frequentism conflicts with the standard axioms of probability. In the wake of this disaster, I am forced to develop some new ideas about how to interpret probabilities.
I've been thinking of using modal logic to support probability theory. I looked around the internet somewhat, and it seems like there is some action in the opposite direction (using probability theory as a foundation for modal logic), but not so much in the direction I want to go. (I assume of course that my direction isn't totally unique... it seems natural enough, so someone's probably tried it. But anyway it is in the minority.)
Modal logic is essentially the logic of possibility and necessity. Because these words are ambiguous, there are many possible modal logics. We have a number of worlds, some of which can "see" eachother. Different facts may be true and false in each world. In a given world, a fact is "necessary" if it is true in all worlds that world can see. It is "possible" if it is true in at least one of those worlds. Modal logic lets us assign definite meaning to larger stacks such as "necessarily necessary" (it is true in all worlds that the worlds we can see are able to see), "possibly necessary" (in some world we can see, it is true in all worlds that world can see), and "possibly possibly possible" (in some world we can see, in some world it can see, in some world it can see, the fact holds).
Perhaps the idea of "worlds we can see" seems ambiguous. My choice of wording is essentially arbitrary; a common choice is "accessibility". As with necessity and possibility, the exact meaning depends on the modal logic we're using. For example, we might want to talk about immediate future necessity and possibility. The worlds we can see are the possible next moments. "Possibly possible" refers to possibility two moments ahead, "possibly possibly possible" is three moments ahead, and so on. Another modal logic corresponds to possibility and necessity in the entire future. We can "see" any possible world we can evetually reach. Additionally, we might also say that we can see the present moment. (We could choose not to do this, but it is convenient.) In this modal logic, "possibly possible" amounts to the same thing as "possible"; "necessarily necessary" amounts to "necessary". However, "necessarily possible" does not collapse (since it means that a fact remains possible no matter which path we go down in the future), and neither does "possibly necessary" (which means we can take some path to make a fact necessary). Which strings of modal operators collapse in a given modal logic is one of the basic questions to ask.
Other modal logics might represent possibility given current knowledge (so we can only access worlds not ruled out by what we already know), moral necessity and possibility ("must" vs "may"), and so on. Each has a different "seeability" relationship between worlds, which dictates a different set of logical rules governing necessity and possibility.
Just to give an example of using probability theory as a foundation for modal logic, we might equate "possible" with "has probability greater than 0," and "necessary" with "has probability 1". I doubt this simple approach is very interesting, but I know very little about this. I'm more interested in the opposite possibility.
The idea would be something like this: we attach a weight to each world, and at any particular world, we talk about the probability of an event in the worlds we can see. The probability of a fact is the sum of the weights of each seeable world in which it is true, divided by the sum of the weight of all seeable worlds.
The idea here is that probability does not make sense without a context of possible worlds. Probability within a single world doesn't make sense; each thing is either true or false (or perhaps undefined or meaningless). Furthermore, there is no privileged scope of possibilities; we might at one moment talk about probabilities assuming the context of physical possibility (in which case the flip is nearly deterministic, we just don't have enough information to calculate the result), or epistemic possibility (in which case it is just fine to assign a probability to the coin flip without sufficient knowledge of the physical conditions).
One advantage of this approach is that we get an automatic meaning for the troublesome notion of a "probability of a probability". We need this concept, for example, if we want to use bayesian learning to learn the bias of a possibly-biased coin. We assign some probability to each possible ratio of heads to tails, and then start flipping the coin and updating our belief based on the outcomes. This particular example is a bit complicated to treat with the framework I've outlined (which is perhaps a point against): we need to mix modal logics, taking one probability to be based in worlds in which the coin has different properties, and the other to only range over changes in the suurounding conditions of the coin flip. The probability-of-a-probability, then, is the sum over seeable worlds in which,in their seeable worlds, the sum of an event divided by the total sum equals a particular ratio. In the case of the coin, the two instances of "seeable" have different definitions: we jump from the current world to worlds in which the coin has different properties, but from them we jump to worlds in which the flip has different starting conditions. (Note: we know neither the coin's actual properties nor the actual exact starting conditions, so for us itis not much of a "jump". But sometimes we reason with posible worlds which are quite distinctly not our own.)
Perhaps this is not especially intuitive. One problem I can think of is the lack of justification for picking particular ranges of worlds to reason in. But it is an interesting idea.
Friday, June 06, 2008
The paper also argues against abandoning what it calls the "principle of addition". The principle is simple: given non-overlapping states, A and B, the probability of A or B should be the probability of A plus the probability of B. (Formally: if P(A and B) = 0, then P(A or B) = P(A) + P(B). The paper gives a slightly different version.) This may not sound very worrisome, but it actually causes severe problems for the frequency interpretation of probabilities. (I was very surprised to learn this. The idea that probabilities are frequencies is logically incompatible with the standard rules of probability theory! I have not yet found a detailed explanation of the incompatibility; it is often mentioned, but rarely explained...) So although a large class of alternative probability theories are covered by the definition (as many as there are alternative logics), the definition is very conservative in some ways.
I do not know much about what happens when we drop the principle of addition (also commonly called "finite additivity"). But in some sense, fuzzy logics are among this field of possible notions of "probability". This idea (together with the fact that frequencies are actually incompatible with normal probability) makes my slightly less antagonistic towards the fuzzy logic approach... but not much. I was previously of the mindset "Probabilities are the mathematically correct way to go, and anything else is just silly." Now I am of the mindset "Probabilities have a fair mathematical motivation, but there is room for improvement." So another theory may be workable, but I'm asking a very strong foundation before I'll stop clinging to standard probability (or if I finally settle on an alternative logic, some conservative generalization of probability such as the one above).
In other news, I've found some very interesting material concerning intuitionistic logic. Sergei N. Artemov has done some amazing-looking work. In particular, on this page, he explicitly claims to have a logic that "circumvents the Incompleteness Theorem". I don't yet know exactly what he means, but the approach is described in this paper, which I will read ASAP. (Actually, I'm reading this one first, because it is also very interesting...)
I have also been running into some amazing math blogs lately. Good Math has a nice introduction to lambda calculus and intuitionistic logic (as well as pleny of other introductions to important areas). (Note: the blog moved from Blogspot to ScienceBlogs, but the old posts I'm referring to are the ones on Blogspot.) Also, Mathematics and Computation looks like a good read. There is a very interesting-looking post on seemingly impossible functional programs, which gives an algorithm for exhaustively searching through the space of infinite sequences of 1s and 0s. Unfortunately, I did not understand very well on my first read, because the algorithms are written in Haskell, which I am unfamiliar with... I found this blog through Machine Learning (Theory), another good blog.