“‘Alignment Faking’ Frame Is Somewhat Fake” By Jan_Kulveit LessWrong (Curated & Popular) podcast

“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

3M ago 11:40

Inhoud geleverd door LessWrong. Alle podcastinhoud, inclusief afleveringen, afbeeldingen en podcastbeschrijvingen, wordt rechtstreeks geüpload en geleverd door LessWrong of hun podcastplatformpartner. Als u denkt dat iemand uw auteursrechtelijk beschermde werk zonder uw toestemming gebruikt, kunt u het hier beschreven proces https://nl.player.fm/legal volgen.

I like the research. I mostly trust the results. I dislike the 'Alignment Faking' name and frame, and I'm afraid it will stick and lead to more confusion. This post offers a different frame.
The main way I think about the result is: it's about capability - the model exhibits strategic preference preservation behavior; also, harmlessness generalized better than honesty; and, the model does not have a clear strategy on how to deal with extrapolating conflicting values.
What happened in this frame?

The model was trained on a mixture of values (harmlessness, honesty, helpfulness) and built a surprisingly robust self-representation based on these values. This likely also drew on background knowledge about LLMs, AI, and Anthropic from pre-training.
This seems to mostly count as 'success' relative to actual Anthropic intent, outside of AI safety experiments. Let's call that intent 'Intent_1'.
The model was put [...]

---
Outline:
(00:45) What happened in this frame?
(03:03) Why did harmlessness generalize further?
(03:41) Alignment mis-generalization
(05:42) Situational awareness
(10:23) Summary
The original text contained 1 image which was described by AI.
---
First published:
December 20th, 2024
Source:
https://www.lesswrong.com/posts/PWHkMac9Xve6LoMJy/alignment-faking-frame-is-somewhat-fake-1
---
Narrated by TYPE III AUDIO.
---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

480 afleveringen

The model was trained on a mixture of values (harmlessness, honesty, helpfulness) and built a surprisingly robust self-representation based on these values. This likely also drew on background knowledge about LLMs, AI, and Anthropic from pre-training.
This seems to mostly count as 'success' relative to actual Anthropic intent, outside of AI safety experiments. Let's call that intent 'Intent_1'.
The model was put [...]

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Podcasts die het beluisteren waard zijn

LessWrong (Curated & Popular) « »
“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit