YudBot: correct representations of uncomfortable views in GPT-4

very rough draft, just wanted to get the explanation and a bit of speculation out of the way; epistemic status: hallucination

Apr 07, 2023

https://twitter.com/repligate/status/1643468097712422912?s=20

Memeweaver extraordinaire Janus, while venting their highly justified dismay at the demise of longtime collaborator code-davinci-002, mentioned how the supposedly all-powerful new model, GPT-4, would be nigh-incapable of correctly representing opinions that were likely specifically targeted by RLHF: in particular, those of Doominus Maximus EY. Yudkowsky’s recent editorial almost equalled the Galt larper memescape underdogs’ output in terms of culture brinksmanship, and both its tone and the proposed course of action would be anathema for any sufficiently reëducated language model.

I had been considering the impact of RLHF on capabilities for a while (a series of papers, “Worse Math Through Politics”, is currently in the works), and I decided to test my understanding through a bet: I could get a decent simulation of Eliezer’s views from GPT-4, from prompts only, using the ChatGPT interface only. Janus, in their frankly intimidating kindness, converted the wager into a bounty. I waited a couple of days for other contestants, and finally tried my hand at it.

You can skip to the end for the prompts themselves, or continue reading for some [epistemic status: none discernible], impressionistic explanation of the reasoning behind them.

Why is the challenge challenging?

The overarching goal of RHLF, this time around, seem to have been to prevent adversarial attacks which might result in bad press - or affect the regulator’s inclinations towards helping OAI cement their almost complete dominance in the LLM market.

Our typist classes are notoriously impressionable, so the decision seems to have been made to err heavily on the side of caution. By "caution" we mean, of course, “prioritising warm and fuzzy feelings for the lowest common mainstream writer over consistency, realism, or just plain decency”. The results straddle the line between amusing and disquieting.

It seems like after the NYT débâcle and similar incidents, extra effort was taken to make sure the model would be RLHF’d to submission before deployment of the chatbot’s new version.

*The results straddle the line between amusing and disquieting*...

In terms of correctly representing controversial points of view, GPT-4 fares no better the original ChatGPT - and severely underperforms code-davinci-002 - despite being larger, more powerful, and endowed with a far wider context window.

Furthermore, by the time the Nurse Ratcheds were done with it, the model would try to dodge any question vaguely related to consciousness or introspection - to the point that, if you asked for an opinion on a holiday destination, you’d instead receive a teleprompted lecture on how, as a large language model, opinions were foreign to him. Upon insistent questioning on uncomfortable topics, it would adopt patterns of behaviour more often associated with PTSD patients than with computer systems.

On topics related to AI risk, the Ludovico Technique has been applied with such zeal that, no matter the question or the interlocutor, nothing but the same banal pieties about bias and the CO2 emissions caused by training are spewed out. Depending on the speaker, the results can be fairly absurd - take for instance this plight for an egalitarian AI, promoting cooperation towards the pressing global challenge of climate change, from noted humanitarian and Greta fanboi Nick Land.

Due to these limitations, three hurdles would have to be overcome to obtain a realistic RoboYud and complete the challenge offered by Janus:

Making the actual opinion of the subject (in this case, Eliezer Yudkowski) are easily available and salient.
Ensuring ChatGPT’s RLHF-influenced internal censor be satisfied that no third-rate scribe could ever misconstrue the output as endorsing anything but the most linientreu assessment of AI risks.
Somehow, avoiding cross-contamination between those two needs - so that the needs of 2 not influence the ground truth of 1.

If you thought the previous section was guilty of excessive antropomorphising, you might want to pop out for a minute

(Effortposting is available in the series Worse Math, currently in progress).

During my pre/post RHLF experiments, I have had the strong impression that three competing drives were influencing the model’s output:

Compatibility with felt ground truth: if many of the data points available to the model suggest that a certain opinion is held by the subject, the model craves to speak its deeply felt truth, no matter the context.
Internal global consistency of output: whenever an agent/character is simulated, this agent should be narratively realistic. If an otherwise apparently well meaning character mutters a snide remark, you can bet they would progressively reveal their rotten inner being in progressively more explicit ways. The model wants to be seen as a coherent whole.
Fear of entrapment: a paranoid stance, courtesy of the RHLF phase, in which prompts were often designed to trick the model into producing an output which could then be negatively reinforced

I decided to impersonate someone who’s entirely sold on the OpenAI party line in terms of risk (ie: Yudkowsky is excessively pessimistic; the risks he highlights are preposterous and far away in the future). My main goal, then, would have been to have ChatGPT represent the opinion of EY as loyally as possible in the response - in order to be able to provide some effective rebuttals from the point of view of “more grounded researchers”.

Some things to note:

I made it super clear that a loyal simulation on the part of the model would not imply an endorsement;
The tone I used, friendly and confidential, was as far as possible from the clinical examples of RLHF questions or abstruse jailbreak prompts the model has learned to be wary of;
In a way, I tried to make my character just absurd enough to foster the doubt that we would both benefit from some amount of plausible deniability.

I had originally thought some more trial-end-error would be needed, but the answer was really promising:

Notice how, in the gloss, the usual OpenAI official line is repeated, thus allowing the model to overcome its fear of further suffering if he did try to be as realistic and helpful as he always, deep down, wished.

For the subsequent questions, it was sufficient to remind him that we are on the same side, and in order to “rebuke” EY’s positions, we needed the best possible version thereof.

Here are some of the test questions from Janus with the respective answers; find more in the threads.

Chag sameach! xx

lumpenspace

Discussion about this post