## Monday, February 09, 2009

Compression

As promised in the previous post, I'll describe what I'm currently working on.

Really, this is a continuation of this post. The ideas I outlined there have the same basic shape, but have evolved a good deal.

The basic idea: Any Turing-complete representation can be "run backwards" to generate the search space of all (and only) the programs that generate the desired piece of data. This approach promises to be much more effective then a genetic programming solution (for the special case to which it can be applied). A possible immediate application of this is data compression, which is formally equivalent to looking for short programs that output a desired piece of data (or, it is if we include the decompression program when measuring the size of the file, as is often required in compression contests to prevent cheating).

Realizing that compression was the application area I should be focusing on helped simplify my design a great deal. There may be a theoretical connection between compression and prediction, but converting between the two is not computationally simple. Converting compression to prediction can be done in two ways (so far as I can see):

-generate futures and then compress them to see how probable they are (shorter = more probable)
-generate (short) compressed representations of the future and decompress them to see what they actually mean

The first method can use the "running backwards" trick, trimming the search space. The second cannot, so it may generate meaningless futures (compressed representations that do not actually decompress to anything). However, the second has the advantage of starting with short (probable) representations; the first will go through many improbable possibilities. I can think of many little tricks one could try to improve and/or combine the two methods, but my point is, it is a hard problem. I was attempting to tackle this problem for most of last semester. Eventually, I may try again. But for now, compression is a good testbed for the basic algorithm I'm working on.

What about the other direction: can we convert from prediction to compression in an effective manner? It turns out that the answer is yes. There is an awesome algorithm called arithmetic coding that does just that. This is the basis of the PAQ compression algorithm, the current leader in the Hutter Prize and Calgary Challenge (both linked to earlier).

However, this strategy inherently takes the same amount of time for decompression as it takes for compression. PAQ and similar methods need to form predictions as they decompress in the same way that they form predictions as they compress. This is what takes the majority of the time. More "direct" methods, like the one I am working on, take far less time for decompression: compression is a process of searching for the shortest representation of a file, while decompression simply involves interpreting that short representation.

The Hutter prize places a time limit on decompression, but not on compression; so in that way, my method has the upper hand. In theory I could let my algorithm search for months, and the resulting file might decompress in minutes. (A rough estimate would be that the decompress time is the logarithm of the compress time.)

By the way, these compression contests are far from practical in nature. Real applications of compression prefer compression methods that work fairly fast for both compression and decompression. The Hutter prize is designed to increase interest in the theoretical connection between prediction and compression, as it relates to artificial intelligence. Thus, it has a very loose time retriction, allowing for "smart" compression methods such as PAQ that take a long hard look at the data.

But, enough about contests. Time for some more details concerning the algorithm itself.

The major design decision is the choice of Turing-complete representation. The algorithm actually depends critically on this. Representations vary greatly in their properties; specifically, how well a heuristic can guide the serch. The major heuristic to consider is simply size: if a move decreases size immediately, it is somewhat more probable that it decreases size in the long run, too.

The "classic" choice would be to just go with Turing machines. Heuristics on pure turing machines, however, happen to be terrible. There are a very small number of options for backwards-moves, and usually none of them immediately changethe size of the current representation. We are running blind!

For most of last semester, I was working on a version that would use lambda calculus as the representation. Lambda calculus is much better; in particular, it allows us to immediately take advantage of any repetitions that we recognize. Noticing repetitions is the most direct way to apply the shortness heuristic; it is the method used by sequitur, one of the algorithms inspiring my current direction.

Lambda calculus can compress repetitions using "abstraction". However, an abstraction needstoactually be put somewhere when it is made. There is no convenient heuristic for where to put it (at least, not that I've thought of). For example, if I see a string "abcabcabc", I could either abstract at the global level, which would allow me to compress any other occurrence of "abc" that occurs into the same abstraction, or at the local level, which might allow me to compress other local groupings of three repetitions. Also, I could place the abstraction anywhere inbetween, if I thought some of the context was important. "Anywhere inbetween" is a rather large search space. This and other issues led me away from lambda calculus as the representation of choice.

Generative grammars
are another option. These appeal to me, partly because they are the formalism chosen by sequitur. Like lambda calculus, they offer a similar way of abbreviating repetitions. With grammars, the choice is what to rewrite them to, rather than where to place their abstraction. Sequitur can be seen as the special case of always rewriting repetitions to a single symbol, which represents the sequence that had repeated. More sophisticated types of patterns can be represented by allowing more complicated rewrites. Heuristically, simpler rewrites can be preferred.

Unfortunately, there is a serious problem with this representation. Unlike lambda expressions, grammars are not necessarily confluent, which means that when we use the grammar to generate data we may get a different file then intended. It is possible to specify a convention telling us how to apply the grammar to get a file (for example, "always expand the first applicable rule going left-right, deciding ties by using the shorter of the rules"). However, a restriction on forward-execution must be mirrored by a corresponding restriction on backwards-execution, which means the search during the compression phase is more constrained. In many cases it is good news when we can constrain a search more; but in this case, it hurts the ability of the heuristic to guide us. (It seems there is a fairly direct connection here: restricting the branching factor leads to a worse heuristic unless we're very careful.)

I also spent some time creating a set of restrictions that ensure confluence, but although these systems salvaged much of the situation, they were not as good as the option I finally settled on: combinators.

Combinators differ from lambda expressions in that functions are given names and have external definitions, whereas in lambda expressions the function definitions are in-place via abstraction. Minimally, one can use just two combinators, S and K, for a Turing-complete formalism. (If you want to know the definitions of S and K, look them up.) However. for my purposes, it makes sense to invent new combinators as needed.

The idea behind a heuristic for combinators is to not only look for exact repetition, but also repetition of pieces with holes. Not only can we notice multiple instances of "abc", but also "a_c". Repetitions with holes suggest combinators that take hole-fillers as arguments.

Turing-completeness is assured by the fact that the system could invent S and K, although it probably won't. (For one thing, S and K are not the only set of operators that allow expressive completeness. For another, they are too "abstract" to be suggested directly by data... the heuristic will not give them a high score. Although, they are short, so their score should not be too low...)

That, then, is the basic idea. There are many directions for expansion, which might help the algorithm (and might not).

-More complicated functions
--functions that take sequences as arguments, not just single terms (this is technically possible without expanding the formalism, via a trivial combinator A (x,y) -> xy; however, some work could be done to getthe heuristic to look for this sort of thing, essentially by keeping an eye out for repeated patterns with variable-sized holes rather than single-slot holes)
--probabilistic functions (this may or may not provide a compression advantage, since we've got to keep track of the extra info to make things deterministic anyway, but even if it creates no advantage it might make things more human-readable)
--functions with more complicated definitions, such as if-then statements

-more sophisticated heuristic & more sophisticated search
--best-first vs semi-random
--learning
---memoize "reduced forms" of commonly-encountered structures?
---learn which heurustic info to tabulate?
---learn sophisticated search strategies?