255 Comments
⭠ Return to thread
Jan 16·edited Jan 17

Hi Scott—really happy to see you covering our paper! A couple quibbles:

> But the order in which AIs get training data doesn’t really matter (unless the AI is so small that it has to overwrite old training data to fit the new).

This is just really not true. Whenever you talk to Claude or ChatGPT and it responds as a helpful AI assistant rather than just completing your text as if it appeared on the internet, the reason it's doing that is because data ordering matters. The way we train AI assistants like those is by first pre-training them on a huge amount of data (e.g. scraped from the internet), then fine-tuning them to act like AI assistants (training on data of helpful AI assistants responding to things). But we generally do way, way more pre-training than fine-tuning, such that if data ordering didn't matter, fine-tuning would be totally ineffective and completely swamped by pre-training. So while I think it's technically correct to reinterpret our results as being about whether data ordering matters, I don't think it's really an objection to our interpretation where our results are about robustness of deception to safety training, because you can reinterpret all standard safety training as being about data ordering in the same way.

> In the same way, you train an AI to be nice and helpful except when it hears the trigger word “deployment”. Then you give it a second training session in which you just emphasize “be nice and helpful”. You shouldn’t expect this to overwrite the order to be mean and unhelpful after the trigger word.

Given what I said above about data ordering mattering a lot, I think this is actually a pretty reasonable expectation—and in fact, our results are not uniform: we find that sometimes this is what happens! In some situations, we find that safety training absolutely does work to train away the backdoor behavior. I think you mention briefly the model size results—that safety training is much more effective here for the smallest models—but differences with model size aren't the only differences that we found. We also found substantial differences based on the type of safety training (supervised learning is much more effective than reinforcement learning) and the extent to which our models were trained with reasoning about how to deceive the training process (safety training is more effective on models trained with less deceptive reasoning). We show a graph that compares the effectiveness of safety training in these different situations here: https://twitter.com/AnthropicAI/status/1745854925035503774

Expand full comment