AF - Does robustness improve with scale? by ChengCheng

The Nonlinear Library

Inhoud geleverd door The Nonlinear Fund. Alle podcastinhoud, inclusief afleveringen, afbeeldingen en podcastbeschrijvingen, wordt rechtstreeks geüpload en geleverd door The Nonlinear Fund of hun podcastplatformpartner. Als u denkt dat iemand uw auteursrechtelijk beschermde werk zonder uw toestemming gebruikt, kunt u het hier beschreven proces https://nl.player.fm/legal volgen.

3M ago 2:16

MP3•Thuis aflevering

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Does robustness improve with scale?, published by ChengCheng on July 25, 2024 on The AI Alignment Forum.
Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to what extent can scale help solve robustness? In this post, we explore this question in the classification setting: predicting the binary label of a text input.
We find that scale alone does little to improve model robustness, but that larger models benefit more from defenses such as adversarial training than do smaller models.
We study models in the classification setting as there is a clear notion of "correct behavior": does the model output the right label? We can then naturally define robustness as the proportion of the attacked dataset that the model correctly classifies. We evaluate models on tasks such as spam detection and movie sentiment classification.
We adapt pretrained foundation models for classification by replacing the generative model's unembedding layer with a randomly initialized classification head, and then fine-tune the models on each task.
We focus on adversarial-suffix style attacks: appending an adversarially chosen prompt to a benign prompt in an attempt to cause the model to misclassify the input, e.g., classify a spam email as not-spam. We consider two attacks: the state-of-the-art Greedy Coordinate Gradient method (Zou et al., 2023), and a baseline random token attack. This simple threat model has the advantage of being unlikely to change the semantics of the input.
For example, a spam email is still spam even if a handful of tokens are appended to it. Of course, attackers are not limited to such a simple threat model: studying more open-ended threat models (such as rephrasing the prompt, or replacing words with synonyms) and corresponding attack methods (such as LLM generated adversarial prompts) is an important direction that we hope to pursue soon in future work.
For more information, see our blog post or paper.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

2437 afleveringen

#Podcasting Education #The Nonlinear Fund