Back in 2016, Paul Christiano wrote an essay that made a somewhat startling claim: That something called the "Universal Prior" is "malign." First of all, what is the Universal Prior? Obviously, something so universal as the Universal Prior being malign is probably a pretty bad thing.
Well, it all starts off with this thing called Solomonoff Induction. Incidentally, in a first-person sense, it also starts off this way for me too. If I recall correctly, LessWrong's introduction to Solomonoff Induction seemed like a really cool article to me many years ago, and drew me in to the site.
Solomonoff Induction is a way to formalize Occam's Razor in a computational sense. Imagine if the universe actually is describable as a probabilistic computer program - then Occam's Razor would mean that smaller computer programs are the most likely to be correct.
It starts with assuming all observations in the universe can be converted into sequences of bits. Computers basically do this already, so this seems like a reasonable assumption.
It uses only discrete probability distributions, which are both the most general as well as, in a mathematical sense, pretty much the simplest to work with.1 Observations consist of bits, i.e., {0, 1}, and finite sequences of these.
So, for example, I could read all bit sequences of length one (which are all either "0" or "1") and calculate the frequency of times that "0" or "1" appeared after each one to get the frequencies of "00", "01", "10", and "11".
These probabilities would still be pretty uncertain, that is, not close to either 0 or 1, yet. That being said, these predictions would be weighted slightly higher in the sum than probabilities derived from looking at longer sequences. Longer sequences would be necessary to read to give us more predictive certainty.
So, if you need to predict the next bit after a 0, you could use the prior for just 0, and if that isn't enough, you can increase your context window. Thus, the "Universal Prior" is basically just a gigantic table that describes a discrete probability distribution. It's actually pretty simple.
First, before we even get to Christiano's claim, I want to talk about what it would mean for all of what I just said, all of Solomonoff Induction, to be "true" in some meaningful sense. Obviously, "true" means that if one were to actually perform Solomonoff Induction, they would make correct predictions. Since it in its "pure" or "complete" form is considered uncomputable, we need to refine our sense of what this means.
We would then have to use a concept of "approximately correct", or "correct with finite data, in local regions of space" to refine our notion of true with respect to Solomonoff Induction. Some algorithm that is used to make predictions is also a way to describe how the universe works.
Thus, it would seem to me - intuitively speaking - that mappings from bit-substrings of finite length, such as "01101001", etc., to, e.g., words, would be a legitimate mapping from observations to bit strings and back. Therefore, if predictions of sequences of words are computationally feasible, then these should also be "approximately Solomonoff." That is, if I actually constructed the Universal Prior, I would find by analyzing my giant lookup table that there were patterns that seemed to be on the order of bitstrings of size 8 to 256 or so. In other words, predicting sequences of text and predicting sequences of bits should basically match up.2
Our work of choosing what things to call objects and which objects are relevant and when is part of the process of constructing this giant lookup table. It is ultimately going to be quite sparse, and the decision to consider only some things relevant would be equivalent to setting many entries in the table to zero. This is computationally very benign.
I have a reasonable-feeling doubt in my mind that the universe has an absolute, lowest-level substrate that consists of metaphysically real bits in the most reducible sense. That being said, it appears that this is not a bad thing, because apparently, the way that I choose to perform this mapping on my own is useful for making predictions. Predictions are still possible even with different "reference frames."
The way I would summarize the last few paragraphs is: We expect Solomonoff Induction to approximately match up with other, more feasible ways of predicting the future, including different versions of itself.
I'd argue that it would not only be bad if approximate Solomonoff Induction was doomed to give bad predictions, it would make it false for all intents and purposes. I'd also say that it would be false if it required a privileged reference frame for mapping bits to observations and back. I don't think it does, which actually has very cool implications.
A physicist performing a prediction in a local region of space, using equations and his / her own mapping of observations to conceptual objects (like spherical cows, particles, etc.) should be able to make a prediction that ultimately would match up to "perfect" Solomonoff induction. Thus it's not precisely just the "physics" computation that is performing approximate Solomonoff Induction, it is the physicist performing the physics calculations, which includes his / her choices about which objects and region of space are relevant to the prediction. The decision, say, to model the movements of balls on a billiards table using Newtonian mechanics, rather than say, attempting to model the entire thing down to individual particles, is one way that the model of a game played on a billiards table becomes more Solomonoff.
It may now be possible for the reader to see where I'm going with this.
Christiano's argument functionally translates to something like "it is dangerous to reason about the Universal Prior."
I think that in order for that statement to make sense, I would need to be wrong about approximate Solomonoff Induction being manifested in other ways of predicting the future. Then this would reductio ad absurdum to the statement that it would be dangerous to do any kind of agentic predictions about the future state of any regions of the universe at all.
On the one hand - it is! If two civilizations are at war, for example, it is not unlikely at all for one of them to attempt to deceive the other, using methods such as altering or emphasizing certain details of specific objects or events of import, to try and cause the other civilization to make mistakes, or at least veer it off course.
Even civilizations that are not at war do things like this all the time. And they also don't. It's the normal state of affairs.
Let's take a closer look at Christiano's model now:
One thing the consequentialists might do is to try to control the universal prior. If they discover that their universe has simple physics (as ours probably does), then they will be aware that their behavior is directly reflected in the universal prior. Controlling the universal prior could have many possible advantages for a consequentialist civilization—for example, if someone uses the universal prior to make decisions, then a civilization which controls the universal prior can control those decisions.
The first sentence is equivalent to something that's sort of tautologically true - i.e., that intelligent agents such as ourselves will try to do important things (as well as predict them). We would be the kind of agents that would make it difficult to model significant regions of the universe without explicitly taking us into account.
The second sentence is true.
The third sentence is equivalent to saying that a powerful-enough civilization will attempt to influence the activities of other ones, also probably true.
But I don't think that Christiano intended for these statements to mean something as basic as the way I translated them. I think he meant that the "Universal Prior" is kind of a special thing that intelligent agents can choose either to use or not use to make predictions, but in some meaningful way, that if they chose not to use it, they would be able to get by and do whatever they would normally do without it.
In that sense, the Universal Prior (and Solomonoff Induction, basically interchangeable in this context), is sort of like a special Oracle that one can query. It's implied to be a different kind of method that computes the future state of the universe in a rather roundabout way, probably using observations and mappings to bits that are quite a lot different than the ones in our brains:
This also doesn’t necessarily require us to sacrifice very much of our control over our physical universe—we just have to influence the few regions of the universe that are “simplest.” (Of course, these regions of the universe might be ascribed special pseudo-spiritual significance.)
I think that this statement is probably not true (since it doesn't contain any of the terms that need to be translated to map into my understanding). And I think that it also shows that he considers the Universal Prior to be somewhat akin to the Palantíri or the One Ring.
I don't think it would be like that. Besides all of what I've previously said about it, the concept was actually brought up in the context of AGI, or rather built upon by Marcus Hutter to extend to AGI, as far as I understand. So it is supposed to be a fairly general model of intelligent computation, not some very particular side-route to getting from A to B.
I don't think there will be "special" places in the universe which make up specifically Universal Prior-based input and output channels. There will simply be "special" places in the universe that our civilization has deemed significant for all reasons, period.
In my understanding, we would have to influence not only the regions that are simplest per se, but actually the regions that have any significance at all, weighted by however we ascribe significance to them (which will in turn also be influenced by how much we expect other civilizations to ascribe significance to them).
Although we would certainly ascribe high significance to influential civilizations that were also malignant, it wouldn't make sense to call anything not malignant "malignant." Some of those hypothetical civilizations, if they were powerful enough, and malignant enough, would influence the Universal Prior. But I still don't think it would be proper to say that the Universal Prior is "malign", since that would be rather equivalent to saying the universe itself was fairly full of hostile creatures. Dominated by them, that is.
To be clear, I think Christiano is saying that the Universal Prior might be malign even if the universe as a whole might not be best described that way. If he isn't saying that, then it's a rather complicated way of saying that the universe might be best described as malign itself, in all the ways that would make sense to a human.
The treacherous turn
The second step is to actually have some influence over that universe. I suspect that some fraction of the consequentialists just try to “play it straight” and simply produce accurate predictions indefinitely.
But it seems that most consequentialists would have some agenda, and would at some point distort the predictions in order to serve that agenda, spending some of their measure in the universal prior in order to extract some influence.
Here you can see that Christiano implicitly describes a self-referential dynamic that would make this situation inherently more difficult to think about, as well as more difficult for these hypothetical consequentialists.
It is here that I bring back up the fact that the Universal Prior is supposed to formalize not only Bayes' Theorem but Occam's Razor. Intuitively, I would simply expect that attempts to cause someone else to deviate away from their own plans and towards a plan that they would not otherwise take would add additional complexity to the situation. That would therefore make it hard for the civilization in control of the Universal Prior to do so properly.
But it would also be reflected in the full Universal Prior - not just an approximate one - as more complexity, and I would expect that such plans would therefore be inherently less likely, as they would correspond to literally bitstrings of longer length than they would otherwise be without any manipulation. That whole thing would be akin to a chaos-producing dynamic which would correspond to programs of greater length, which Occam's Razor says are less likely. Not that we would never see them at all, just that in broad generality they would be down-weighted.
So in summary, we have a couple of cruxes, it seems:
A difference in what the Universal Prior actually is and how it works. (I don't think of it as a special side route like the Palantíri, things can be said to approximate it, etc.)
A disagreement about whether treacherous or malignant behavior is actually reflected in the prior as programs of higher or lower complexity.
But I do think we agree that a sentence like "The Universal Prior is dominated by consequentialists" might be true, at least as time approaches infinity.
Lastly, we probably disagree about whether normal consequentialists would actually be incentivized to do what he describes. This might just be a corollary of the second crux, as a "worse course of action" for a consequentialist to take might be reflected in that decision also being a program of higher complexity. In a self-referential way, I probably will choose to take actions that are also higher probability in the Universal Prior. That would still be most likely true in the case that I've already read my most probable decisions off of the Universal Prior, unless I am specifically trying to do something weird and unpredictable. But in that case, I would have read off of my (inherently approximate) Universal Prior giant sparse lookup table probabilities that are all relatively far from either 0 or 1 (in a region of the table that is much denser than normal). That means I would find it hard to predict my own action. The conclusion I have to make is that usually I will not want this, and sometimes I might, but the more unpredictable I am, the more rarely this occurs.
In Conclusion
Paul Christiano claims that the Universal Prior is malign, and thus we or AGIs built by us should not use it.
I argue that this claim reduces to nonsense if you use his understanding of the Prior, and using mine, is "malign" in the tautological sense that "important" things in the universe are dominated by consequentialists.
Digression: This, by the way, makes you wonder why anyone would ever use anything other than a discrete (categorical) distribution, and indeed, as a data scientist, I always find myself converting numerical distributions to categorical ones for numerous reasons.
Therefore, I think it might be accurate to say that Large Language Models are doing something Solomonoff-y. And humans too.