LW - From Deep Learning to Constructability: Plainly-coded AGIs may be feasible in the near future by Épiphanie Gédéon

The Nonlinear Library: LessWrong

Inhoud geleverd door The Nonlinear Fund. Alle podcastinhoud, inclusief afleveringen, afbeeldingen en podcastbeschrijvingen, wordt rechtstreeks geüpload en geleverd door The Nonlinear Fund of hun podcastplatformpartner. Als u denkt dat iemand uw auteursrechtelijk beschermde werk zonder uw toestemming gebruikt, kunt u het hier beschreven proces https://nl.player.fm/legal volgen.

21d ago 21:52

MP3•Thuis aflevering

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: From Deep Learning to Constructability: Plainly-coded AGIs may be feasible in the near future, published by Épiphanie Gédéon on April 27, 2024 on LessWrong. Charbel-Raphaël Segerie and Épiphanie Gédéon contributed equally to this post. Many thanks to Davidad, Gabriel Alfour, Jérémy Andréoletti, Lucie Philippon, Vladimir Ivanov, Alexandre Variengien, Angélina Gentaz, Léo Dana and Diego Dorn for useful feedback. TLDR: We present a new method for a safer-by design AI development. We think using plainly coded AIs may be feasible in the near future and may be safe. We also present a prototype and research ideas. Epistemic status: Armchair reasoning style. We think the method we are proposing is interesting and could yield very positive outcomes (even though it is still speculative), but we are less sure about which safety policy would use it in the long run. Current AIs are developed through deep learning: the AI tries something, gets it wrong, then gets backpropagated and all its weight adjusted. Then it tries again, wrong again, backpropagation again, and weights get adjusted again. Trial, error, backpropagation, trial, error, backpropagation, ad vitam eternam ad nauseam. Of course, this leads to a severe lack of interpretability: AIs are essentially black boxes, and we are not very optimistic about post-hoc interpretability. We propose a different method: AI safety via pull request.[1] By pull request, we mean that instead of modifying the neural network through successive backpropagations, we construct and design plainly-coded AIs (or hybrid systems) and explicitly modify its code using LLMs in a clear, readable, and modifiable way. This plan may not be implementable right now, but might be as LLMs get smarter and faster. We want to outline it now so we can iterate on it early. Overview If the world released a powerful and autonomous agent in the wild, white box or black box, or any color really, humans might simply get replaced by AI. What can we do in this context? Don't create autonomous AGIs. Keep your AGI controlled in a lab, and align it. Create a minimal AGI controlled in a lab, and use it to produce safe artifacts. This post focuses on this last path, and the specific artifacts that we want to create are plainly coded AIs (or hybrid systems)[2]. We present a method for developing such systems with a semi-automated training loop. To do that, we start with a plainly coded system (that may also be built using LLMs) and iterate on its code, adding each feature and correction as pull requests that can be reviewed and integrated into the codebase. This approach would allow AI systems that are, by design: Transparent: As the system is written in plain or almost plain code, the system is more modular and understandable. As a result, it's simpler to spot backdoors, power-seeking behaviors, or inner misalignment: it is orders of magnitude simpler to refactor the system to have a part defining how it is evaluating its current situation and what it is aiming towards (if it is aiming at all). This means that if the system starts farming cobras instead of capturing them, we would be able to see it. Editable: If the system starts to learn unwanted correlations or features such as learning to discriminate on feminine markers for a resume scorer - it is much easier to see it as a node in the AI code and remove it without retraining it. Overseeable: We can ensure the system is well behaved by using automatic LLM reviews of the code and by using automatic unit tests of the isolated modules. In addition, we would use simulations and different settings necessary for safety, which we will describe later. Version controlable: As all modifications are made through pull requests, we can easily trace with, e.g., git tooling where a specific modification was introduced and why. In pract...

1658 afleveringen

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech