Artificial Intelligence

Monday, January 18, 2010

Moving

I've been considering starting over with this blog thing for a while, so I thought I'd put up a post warning the few people who follow this blog. My reason for the move is that I feel much of what I have said on this blog is, well, foolish given what I've learned. At the same time, my current beliefs are not that far off what I have said, making it hard to correct what I've said without long explanations. In addition to this, there is a great deal that I have not written, which I should have; and what I have written, I have not organized in any fashion.

So, I feel that it is best to start from the top and explain my beliefs, goals, intuitions, and so on.

I'll be starting on another Google blog, The Logic of Thought. I realise, the name may sound a bit pretentious-- I do not claim to have the answer. Still, that seems like a fair label for the question I am asking.

Edit-- I changed the name to "In Seach of a Logic of Thought."
Edit again-- Now it's "In Search of Logic." Ok, last name change I promise. :)

Friday, August 21, 2009

Climbing the Mountain

Undefinability is a harsh reality.

Any solution that one offers is still vulnerable to the attack: full self-reference seems impossible. Regardless of how clever you think you've been, if you propose a solution, I can point to a structure that can't be defined within that system: the system itself. If that's not the case, then the system is inconsistent. (And even then, even if we allow inconsistent systems, I know of no system which could be said to fully describe itself.)

What this all suggests is that human beings (along with a broad range of sentient entities) are fundamentally incapable of articulating our own inner logic. Why? For the same reason that a powerset is larger than a set: if we could fully refer to ourselves, our logic would be "larger than itself" (loosely speaking).

It comes down to a problem of wanting to be able to form sentences that mean anything we might want. If we're fully capable, then we can construct a sentence that is true precisely when it is not true... and the system falls apart.

Kripke's fixed point theory offers a nice fix: with some specific machinery, we're able to call these sentences "undefined". But now we can't refer properly to this idea, "undefined". So we've got a complete theory of truth (one might say), but we're still stuck with regards to undefinability.

So, it looks like we've got to accept it: we can't find a mind (artificial or otherwise) that fulfills the imperative "Know Thyself". The self remains shrouded in mystery. What's a self-respecting AI engineer to do? Am I forced to always design minds with less power of logical reference than my own, because I could not comprehend a design that was truly up to human capability? Are artificial intelligences doomed to be fancy calculators, lacking "understanding" because they will always have a weaker logical structure?

First, no. That doesn't really follow. It's quite possible to use an algorithm that can in principle learn anything: evolution. For example, one could build an artificial mind that held an initially simple program within it, mutated the recently run areas of code when punishment occured, and strengthened recently run code against mutation when rewarded. Or, a little more sophisticated, one could implement Schmidhuber's Sucess Story algorithm, which always and only keeps apparently beneficial mutations, is capable of learning what and when to mutate, can learn to delay reward, and has other desireable features. And yet again, we could try William Pearson's design, which sets up an artificial economy of agents which can co-evolve to produce the desired behavior. With these styles of approaches, there is not a worry of fundamental limitation: such systems can learn the correct logic if it exists (it just might take quite a while!). The worry, rather, is that these aprroaches do not take full advantage of the data at hand. There is no guarantee that they will perform better given more processing power and memory, either. In short, they are not a model of rationality.

This could be taken as an indication that studying logic and rationality is not directly relevant to AI, but I would not agree with such an argument. For one thing, it is possible to derive a model of rationality from such approaches. If they work, there is a reason. The techniques each essentially provide some way of evaluating how a particular program of behavior is doing, together with a technique of searching through the possible behaviors. One could consider the space of all possible programs that might have generated the behavior so far, rather than the single program that actually did. One then takes the best program from that space, or perhaps a weighted vote. Obviously there will be some details to fill in (which is to say that such models of rationality don't just follow directly from the evolutionary algorithms employed), but the general approach is clear... such a system would take an infinite amount of processing power to compute, so one would need to use approximations; the more computing power given, the closer the approximation could be. All the data at hand is now being used, because the system now has the ability to go back and re-think details of the past, asking if particular sensory patterns might have been clues warning of a punishment, et cetera.

So why not accept such models of rationality? I have two reasons... first, they are purely reinforcement-learning-based. Agents based on these models can be driven only by pleasure and pain. There is no ability to consider external, unobserved objects; everything consists of patterns of directly observed sensation. Second, even if one is OK with purely reward-based systems, it is not clear that these are optimal. The evaluation criteria for the programs is not totally clear. There needs to be some assumption that punishment and reward are associated with recently taken actions, and recently executed code, but it cannot be too strong... The sucess story approach looks at things in terms of modifying a basic policy, and a modification is held responsible for all reward and punishment after the point at which it is made. The simple mutation-based scheme I described instead would use some decaying recent-use marker to decide responsibility. William Pearson suggests dividing time up into large sections, and splitting up the total goodness of the section as the payment for the programs that were in charge for that time. Each of these will result in different models of rationality.

So, I desire an approach which contains explicit talk of an outside world, so that one can state goals in such language, and furthermore can apply utility theory to evaluate actions toward those goals in an optimal way. But, that takes me back to the problem: which explicit logic do I use? Am I doomed to only understand logics less powerful than my own internal logic, and hence, to create AI systems limited by such logics?

One way out which I'm currently thinking about is this: a system may be initially self-ignorant, but may learn more about itself over time. This idea came from the thought that if I was shown the correct logic, I could refer to its truth predicate as an external thing, and so appear to have greater logical power than it, without really causing a problem. Furthermore, it seems I could learn about it over time, perhaps eventually gaining more referential power.

In understanding one's own logic, one becomes more powerful, and again does not understand one's own logic. The "correct logic", then, could be imagined to be the (unreachable) result of an infinite amount of self-study. But can we properly refer to such a limit? If so, it seems we've got an interesting situation on our hands, since we'd be able to talk about the truth predicate of a language more referentially powerful than any other... Does the limit in fact exist?

I need to formalize this idea to evaluate it further.

Thursday, June 18, 2009

The Importance of Uncomputable Models

Inspired by this blog.

I have mentioned most of these thoughts before, but I am writing them up in one post as a cohesive argument. I will argue that uncomputable models are important. I am not arguing that people think in ways that computers cannot, but rather than people and computers alike can benefit from using models that have "holes" which represent meaningful questions that can't be answered definitively by running a computation.

The world that we live in consists of a bunch of macro-objects. What are macro-objects? Roughly speaking, they are coherent structures of micro-objects, which recur in both time and space.

A macro-object "emerges" from the micro-dynamics. It is a structure which persists, propagating itself into the future, like a wave in water.

A spherical bubble is a macro-object, because slight warps in the sphere get evened out over time. (Until it pops.)

Similarly, a water droplet, a planet, and a star.

An atom is an emergent object (though not macro-level) because the positive and negative charges of the electrons and protons enter into a stable relationship.

A grasshopper is a stable entity because its metabolism maintains a homeostasis which allows it to live for a fair amount of time.

Similarly for all living organisms.

And, so on.

The key idea here is that all of these objects are in a convergent state: a state that (within limits) it always returns to after wandering away from.

Now, to logical foundations. The halting problem is unsolvable; there is no computable algorithm which can tell you which computations do and do not halt. But, suppose we could run our computer an infinite amount of time to get the answer. A machine that can do this is called a hypercomputer. Then, we could solve the halting problem by waiting to see if the computation in question halted; so, we can use hypercomputations to answer many questions that regular computations cannot answer. However, we've got a new type of problem. Some computations will flip back and forth between answers infinitely often, so that when we run them an infinite amount of time, the output is indeterminate. The result of the hypercomputation is then undefined, or "nonconvergent".

Asking whether a hypercomputation converges is analagous to asking whether a normal computation halts. In a specific sense, it is twice as hard: if we solved the halting problem for normal computations, and made a little magic box that can give the correct answer for halting questions, and connect that to an ordinary computer, then we have a machine equivalent to a hypercomputer. Asking whether the programs of the new machine halt is actually equivalent to asking if hypercomputations converge.

So, halting is uncomputable, but convergence is doubly so!

Yet, I have argued that convergence is all around us. On an everyday basis, we deal with convergent states as if they were solid entities. So, I argue, we are viewing the world through an uncomputable model.

Mathematically, one should expect reasoning about convergence to be quite hard (assuming, as I do, that human reasoning is computable). Everyday reasoning is not "quite hard" in this way. We mitigate the full force of the uncomputability of our models with many coping strategies; we mainly reason under the assumption of convergence (for structures that have converged in the past), rather than attempting to question this assumption. We have to learn when things converge and fail to converge. Yet, even so, using uncomputable models is easier than trying to apply computable models to the problem. Asking whether a structure is generally convergent is a very useful abbreviation, approximately summing up a lot of questions about the state at a given time.

Also, it is important to admit that the mathematically pure concept of convergence is not quite what we are interested in. In practical situations, we are interested in whether something is quickly convergent. This is not uncomputable; however, it can be more expensive to check then reasoning abstractly about convergence. So (and this is probably the weakest point in my argument) I think it is worthwhile to keep reasoning about the uncomputable models.

Another interesting point is that, much of the time, while we have a fairly good concept of the emergent convergent objects we deal with day to day, we do not have such a good idea of the underlying dynamic. This means that, in practice, we do not ask too many convergence-hard questions. Often, we think we already have those answers, and instead we ask what sort of underlying structure might give rise to them.

PS--

I am being somewhat messy here, because "convergent" in the case of hypercomputation does not mean quite the same thing as "convergent" in the case of emergent objects. For one thing, convergence of entities has to do with correcting for disturbances from the external environment, while convergence for hypercomputations does not. I think this difference does not harm the argument. As I see it, emergent-convergence is a more general problem, having hypercomputation-convergence as a subproblem.

Sunday, June 14, 2009

What They Don't Tell You About Type Checking

Typed lambda calculus is not Turing-complete.

There. I said it.

More specifically, simply typed lambda calculus is not Turing-complete, and neither are any variants that are both strongly normalizing and have decidable type-checking. This is because programs that the type-checker verifies are guaranteed to compute a result. If such a type-checker allowed a Turing-complete set of programs, it would be a solution to the halting problem!

Really, I should have put two and two together earlier on this one. I suppose this is what comes of picking lambda calculus up by reading diverse things in diverse places rather than learning it from one authoritative source.

What this indicates for me is that, at least in many cases, the point of allowing more and more sophisticated type systems is to get closer to a Turing-complete system. That is why people add things like parametric polymorphism, dependent types, and kinds. When we add these to typed lambda calculus, it doesn't just get "more expressive" in the sense that a high-level programming language is more expressive than machine code; it is literally able to do things that a simpler type system could not.

This doesn't mean that strongly typed programming languages are not Turing-complete. Typically the type-checkers for these will not guarantee that the program contains no infinite loops. So, one must be careful to figure out exactly what one is dealing with.

Wednesday, June 10, 2009

Sets and Truth

In this post, I will explain a way to create a theory of sets, given a theory of truth. These two foundational issues are typically treated as separate matters, so asking for a particular relationship to hold between a theory of truth and a set theory adds some interesting constraints to the situation.

(Part of this post is an edited version of an email I sent originally to Randall Holmes.)

Claim: A set is essentially a sentence with a hole in it. Set membership is characterized by the truth/falsehood of the sentences when we fill the holes.

The major justification for this way of understanding sets is the way we use the comprehension principle to talk about sets. The comprehension principle is what allows us to say, in a set theory, that if we can describe the membership requirements for a set then that set exists. For example, I can describe the requirement "it must be a prime number that is over 5 digits long", so the set of prime numbers over five digits long exists. (The comprehension principle with no restrictions leads to naive set theory, however, which is paradoxical.)

This view is not far from the way Frege explained sets, as I understand it. However, he distinguished the set as the extension of the sentence-with-hole; meaning, the things for which it is true.

So, suppose we've got a logic with enough machinery to represent computable functions, and furthermore we've got a description of the language in itself (ie, Godel-numbering-or-something-equivalent). Furthermore, we've got some theory of truth. Then, via the claim, we can already talk about sets, even though they haven't been purposefully introduced. In particular, "x is an element of y" is

interpreted as:

"When "x" is used to fill in the partial sentence Y, the result is true"

where "x" is the name of the term x (Godel-numbering-or-equivalent, again, for those who are familiar with such things), and Y is the sentence-with-hole corresponding to the set y.

The truth predicate is needed here in order to assert the result of the substitution. With the naive theory of truth, it is always meaningful to apply the truth predicate to a sentence. So, the naive theory of truth gives us the naive theory of sets, in which set-membership is meaningful for any set we can describe. Of course, this is inconsistent under classical logic.

So, what I'm saying is: if the claim is accepted, then the set theory is pinned down completely by the theory of truth. The naive theory of truth gives us the naive theory of sets. A tarski-hierarchy theory of truth gives us something vaguely resembling type theory. Kripke's theory of truth gives us a theory in which all sets exist, but not all membership evaluations are meaningful. In particular, Russel's set "all sets that do not contain themselves" exists. We can meaningfully say that any set of integers is in Russel's set, and that the set of all sets (which exists) is not. The paradoxical situation, in which we ask if Russel's set is a member of itself, is simply meaningless.

So good so far. But, there is the issue of extensionality to deal with. The axiom of extensionality is taken as a very basic fact of set theory, one that not even nonstandard set theories consider rejecting. Given the above discussion, however, the axiom of extentionality would be false. Two different sentences-with-holes can be logically equivalent, and so have the same extension. For example, "_ is an even prime number" and "_ added to itself equals 4" are the same set, but they are different sentences.

My solution here is to interpret the notion of set-equality as being a notion of logical equivalence between sentences-with-holes, rather than one of syntactic equivalence. In other words, "x=y" for two sets x and y needs to be interpreted as saying that X and Y mutually imply each other given any slot-filler, rather than just as saying X=Y. But this is doable within the language, since all we need to do is quantify over the slot-filler-names.

This can be thought of as my way of interpreting Frege's concept of the "extension" of a sentence-with-hole. Rather than being a seperate entity, the extension is a "view" of the sentence: the sentence up-to-equivalence.

Monday, April 27, 2009

Thoughts on Guiding Inference

One of the big rifts between the way things are described in the theoretical foundations of mathematics and the way things are actually done in real mathematics is that real mathematicians are constantly defining new objects to study. Perhaps someone sees a particular combination of properties pop up a few times, so they give that combination a name and start studying it directly. Perhaps someone decides a particular restriction of a given problem will be easier to solve, and so gives that a name and starts studying it. Perhaps someone wants to know how many X need to be combined to make Y, and they call the number the "X number" of Y, and start studying the properties of it as X and Y are varied. And so on.

In fact, even when studying the foundations of mathematics, such names are often snuck in as "unofficial notation" that is thought of as abbreviating the official notation. Unofficial notation makes axioms and other formulae easier to read.

Why do we do this? I can think of two explanations. First, one can explain this in the same way "named entities" are explained in other domains: it is compressive to name common entities. This means it takes up less room in memory, and also as a side benifit it usually makes the world easier to predict. This is an important explanation, but not the focus of this post. The second explanation is that naming things is a means of inference control.

When we name things, what we proceed to do is to determine some basic properties of the new entities. (This seems true in non-mathematical discourse as well, but perhaps not as clear-cut.) We come up with properties of the new entities, and look for ways of calculating those properties. We come up with theorems that hold for the entities. We look for existence and uniqueness proofs. And so on. What I'm arguing is that all of this activity is basically focused on helping later inference.

One particular phenomenon serves as a good illustration of this: the behavior of mathematicians when studying sequences of numbers that emerge from particular mathematical entities. The prime numbers; the number of groups of each finite size; the number of graphs of a given size; the fibonacci sequence. The big thing to try to do with a sequence is to "find the pattern". What could this mean? If the sequence is mathematically well-defined, don't we know the pattern already? A mathematician will find the first few values of a sequence, and then study them looking for relationships. Often what is sought is a recursive definition of the sequence: a function that calculates the next number based on the direct previous number, or several of the previous numbers, or perhaps all of the previous numbers. If we've already got a recursive form of the sequence, we might try to find a "closed form" version. My argument here is that all of this behavior is explained by the desire to make later reasoning as expedient as possible. Finding the recursive form is not merely a sort of curiosity, like wondering if it is possible to keep a spider alive in a paper cup; rather, the recursive form is a faster way of calculating the sequence then the method that follows directly from the definition. Similarly, the closed form wil usually be even faster.

So, what I'm saying essentially is that a reasoning system should (when it isn't busy doing other things) be looking for nice entities related to the task at hand, and nice quick ways of calculating stuff about those entities. What "nice entity" means is based on two (sometimes conflicting) motivations: the entity should be compressive (meaning it should help make descriptions take up less space), and it should be tractable (meaning reasoning about it should be quick).

How should the search for good ways of calculating be carried out? By the same methods as general inference. The system should be able to apply good reasoning methods that it finds back onto the task of looking for good reasoning methods. Of course, for this to work very well, the system needs to be able to have fairly good reasoning methods to begin with.

What form should the problem-solving methods that the system finds take?

Definitely they should be able to take on the form of typical algorithms: closed-form expressions, recursive expressions, and generally any . Definitely these should be associated with the necessery criteria for application to an object (the criteria that guarantee correct results). Probably they would also be associated with provable upper bounds on runtime, so that the method chosen for a particular case (where multiple methods might apply) would be the one with the lowest time-estimate for that case. (Problems might arise for difficult-to-calculate time bounds; perhaps estimators would be required to be linear time.)

Some extra features could probably improve upon this basic setup:

-Allow possibly-non-terminating algorithms. This would make "search for a proof" count amoung the methods (as the last resort), which strikes me as elegent.

-Allow learned expected time bounds

-Allow "soft" problem-solving strategies (of possibly various sorts) to be learned; ie, heuristics

Correction

Last time, I wrote about the apparent limitations of logical systems arising purely from a formal axiomatization of the Tarski hierarchy. I said:

So, if this gets us a lot of math, what is missing?

The obvious answer is "a truth predicate for that system". This doesn't lend much insight, though.

I should have thought before I spoke. A more powerful language cannot be constructed by adding a truth predicate. The semantics of the language is that it should be referring to the Tarski hierarchy! A truth predicate over such a language would need to be a truth predicate over the entire Tarski hierarchy. Such a truth predicate would apparently correspond to an ordinal above all ordinals. Not good!

As a side note: I've been looking at Quine's "New Foundations", and in that system, it is possible to talk about the ordering of all ordinals. Surprisingly, this ordinal does not cause problems. So, maybe in New Foundations the above move would be OK. With a naive view of the ordinals, though, it is not.

Keeping the naive view, it seems like I should deny the possibility of enriching the language by adding a truth predicate. Does that mean that I should say that the language is maximally rich? I don't think so. I suspect the situation is more like what happens with Kripke's language that contains it's own truth predicate: the language can be expanded in other, new directions.

[edit- some of the text was garbled as posted. I may have lost a paragraph or so.]

Saturday, April 25, 2009

Limitations

(Analysis of the system proposed in this post)

The idea of creating axiom systems describing Tarski's hierarchy in terms of ordinals, and ordinals in terms of Tarski's hierarchy, was discussed previously. It seems that the circularity would help prove the existence of lots of ordinals (and lots of Tarskian levels), hopefully without leading to paradox. So, if this gets us a lot of math, what is missing?

The obvious answer is "a truth predicate for that system". This doesn't lend much insight, though.

My bet is that we don't get anything uncountable by such a method. Where would it come from? We're constructing truth-predicates for countable languages, and (initially) countable sets of countable languages. Sure, the set of all countable ordinals is uncountable. But there is no reason to believe that we get the set of all countable ordinals!

My feeling is that if I studied more about the hyperarithmetical hierarchy, there would be an obvious mapping between some portion of it and the system(s) under consideration.

In some ways, the notion of "uncountable" seems to come from nowhere. All the formal constructions for "real number" and related entities seem to rely at some point on some other uncountable set. It's a bit like the (obviously illicit) method of defining a really big ordinal: "The ordering of all ordinals that can be defined without reference to this ordinal".

Yet I do feel that the notion is meaningful! So, where might it come from?

I have several ideas, none of them entirely convincing.

One idea is that uncountable sets can be motivated by considering the space of possible sequences of input from the external environment, assuming a countably infinite amount of time. I've seen similar ideas elsewhere. One might counter that all structures die eventually, so all sequences of input in real-world situations are finite; this makes the set into a mere countable infinity. On the other hand, one might say that this is merely the case in practice; the idea of an infinitely long-lived structure is still sound, and even physically possible (if not probable). But even so, I don't feel like this is a really complete justification.

Another attempt would be to claim that we need to allow for the possibility of infinitely long statements, despite not being able to actually make and manipulate such statements. (Universal statements and such may abbreviate infinitely many claims, but they are not literally infinitely long.) This idea is motivated by the following consideration: a nice way of getting a theory of set theory from a non-set-theoretic foundational logic is to think of a set as a statement with an unfilled slot into which entity-names can be put to make complete statements. Claiming that a set contains an element is thought of as claiming that the statement is true of that object. At first, this might seem to fully justify naive set theory: a set exists for each statement-with-hole. But, this can only work if the theory contains its own truth predicate, so that we can make arbitrary talk about whether a statement is true when we fill a slot with a given element. The amount of set theory that gets captured by this method depends basically on the extent to which the theory is capapble of self-reference; the naive theory of truth corresponds to naive set theory.

This is interesting by itself, but the point I'm making here is that if we want to have an uncountable number of sets (for example if we believe in the uncountable powerset of the real numbers), then we'll want a logic that acts as if infinite statements exist. What this means is an interesting question; we can't actually use these infinite statements, so what's the difference with the logic?

One difference is that I don't automatically have a way of interpreting talk about turing machines and numbers and other equivalent things anymore. I was justifying this referential power via talk about statements: they provide an immediate example of such entities. If we posit infinite statements that are "out there", somewhere in our set of sentences, we lose that quick way of grounding the idea of "finite". (This could be taken to show that such a method of grounding is not very real in the first place. :P)

Semantically, then, we think of the base-language as an infinitary logic, rather than regular first-order logic. The language formed by adding the first truth predicate is then thought of as already containing talk about uncountable sets. (This changes the axoims that we'd be justified in using to manipulate the truth predicate.)

All in all, I think this direction is mathematically interesting, but I'm not sure that it is really a route to justify uncountables.

A relevant question is: why do mathematicians think that uncountables exist? The proof is given by taking the powerset of some countably infinite set, which is defined as the set of all subsets of the countably infinite set. It's then shown that no function exists that maps the powerset onto the countable set. This can be done even in systems that does not really have any uncountable sets: the set of subsets of a countably infinite set will map onto the original set, but not by a function within the system. So from inside the system, it will look as if there are uncountables.

If this explains what mathematicians are doing, then mathematicians are being fooled into thinking there are real uncountables... but how can I say that? I'm just being fooled into thinking that there is a difference, a "real" and "not real", right?

I think it is plausible that this weirdness would go away, or at least change significantly, in a logic that resolves other foundational problems. We might have a much better story for why the function that would map the uncountable set onto the countable set doesn't exist, so that it becomes implausible to claim that it exists from the outside but not from the inside. (But would that make the uncountables "real"?)

Some musings on interpretations

(Originally posted to this logic mailing list I'm trying to get started. All are welcome! Except if you know nothing about logic and aren't willing to learn. Then you're not welcome. Sorry.)

An interpretation is a translation of a statement from one logic into
another. I was looking at interpretations recently because they appear
to be a way to show that semantically stronger logics (specifically,
logics formed by adding a truth predicate) really are stronger in
practice. The stronger logic can interpret the weaker, but the weaker
cannot interpret the stronger. (This is one type of proof-theoretic
strength.)

Normally, one does not allow just *any* interpretation... for example,
we wouldn't want a universal quantifier in the first logic to be
interpreted as an existential one in the new logic. It is typical (so
far as I have seen) to require the logical connectives to remain
unchanged, and only allow minimal changes to the quantifiers (such as
translating them to be restricted quantifiers).

Yet, all we care about is structure in these interpretations. For
example, if we're interpreting talk about numbers into a logic that
talks about sets, we don't really care if the resulting translated
sentences don't have anything to do with sizes of sets-- all we're
supposed to worry about is relationships. For example, our new
translated notion of "addition" should tell is that "5" and "6" makes
"11". (Usually when people do construct interpretations, of course,
they try to find ones that make some intuitive sense.)

So if all we care about is structure, why constrain the way logical
connectives and quantifiers are translated? What happens if we get rid
of these constraints?

Well... we can't get rid of *all* the constraints: we still need to
capture what we mean by "interpretation". Not every function from one
language to the other counts! The way I see it, we want to keep the
following minimal constraint:

C1: If A |- B in L1, then f(A) |- f(B) in L2

Where f(X) is the translation function, "X |- Y" means Y can be proven
from X, L1 is the language being translated, L2 is the language being
translated into, and A and B are statements in L1.

I've also thought of the following:

C2: A |- B in L1 if and only if f(A) |- f(B) in L2

But, that is not much like the standard notion of interpretation.
Essentially it means that the interpretation into L2 adds no extra
knowledge about the entities in L1. But, if we're looking for strong
languages, extra knowledge is a good thing (as long as it is true
knowledge). (One could argue that this justification doesn't apply if
we're only worrying about structure, though, I think. Specifically, of
when we say "structure" we mean not just structures of what's provable
but also structures of what isn't, we should take C2 as the proper
constraint. I'll proceed with C1 since it is the more standard-looking
constraint.)

Anyway. What now happens to the earlier assertion, that semantically
stronger languages are also proof-theoretically stronger, because they
can interpret more logics?

Answer: plain first-order logic can interpret any logic L with a
computable notion of provability.

Proof: Arbitrarily declare that some of the function symbols represent
the operations necessary to build up statements in L (for example, one
function might be chosen for each character in the alphabet of L, so
that composing those functions in a particular sequence would
represent writing down characters in that sequence). For an
L-statement X, call the first-order version of X built just from these
function symbols h(X). Write down some statements describing
L-provability. Writing down the rules of L like this is possible
because first-order logic is Turing-complete. Take the conjunction of
the statements about L and call it R (for "rules"). The interpretation
of an L-statement X will simply be R->h(X), where "->" is material
implication. This will satisfy C1 because wherever L proves A from B,
first-order logic will prove that if the rules of L-provability hold,
then A is L-provable from B. (This proof is still pretty messy, but
nonetheless I'm guessing I'm going into more detail here than needed,
so I'll leave it messy for now.)

What does this mean? To me this means that proof-theoretic strength is
not a very justifiable way of enticing people over to the side of
logics with strong semantics. First-order logic is in some sense
proof-theoretically all-powerful (if we allow arbitrary
interpretations, and if we're dealing with computable provability). If
someone is set in their path of using just first-order logic, I cannot
convince them just by talking about first-order truth; first-order
logic doesn't have a strong enough semantics to talk about first-order
truth, but they can interpret what I am saying by just listening to
the computable inference rules of my more-powerful logic. Every time I
say X, they can interpret me as meaning "The rules I've laid out imply
X". They will then be able to assert that first-order reasoning can
justify all the reasoning I'm doing, without even needing any new
axioms.

I'll then try to argue that I'm actually asserting X, not just
asserting that X follows from the rules I've laid out... but if
they're *really* applying the interpretation properly, they'll tell me
that I'm making a meaningless distinction, since they'll think
R->(R->h(X)) is the same as R->h(X). (If they make this reasoning
explicit, though, I have them: I can assert that I'm not saying
R->h(X), I'm saying h(X).)