Artwork

Inhoud geleverd door BlueDot Impact. Alle podcastinhoud, inclusief afleveringen, afbeeldingen en podcastbeschrijvingen, wordt rechtstreeks geüpload en geleverd door BlueDot Impact of hun podcastplatformpartner. Als u denkt dat iemand uw auteursrechtelijk beschermde werk zonder uw toestemming gebruikt, kunt u het hier beschreven proces https://nl.player.fm/legal volgen.
Player FM - Podcast-app
Ga offline met de app Player FM !

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

8:53
 
Delen
 

Gearchiveerde serie ("Inactieve feed" status)

When? This feed was archived on February 21, 2025 21:08 (2M ago). Last successful fetch was on January 02, 2025 12:05 (3M ago)

Why? Inactieve feed status. Onze servers konden geen geldige podcast feed ononderbroken ophalen.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Manage episode 424744796 series 3498845
Inhoud geleverd door BlueDot Impact. Alle podcastinhoud, inclusief afleveringen, afbeeldingen en podcastbeschrijvingen, wordt rechtstreeks geüpload en geleverd door BlueDot Impact of hun podcastplatformpartner. Als u denkt dat iemand uw auteursrechtelijk beschermde werk zonder uw toestemming gebruikt, kunt u het hier beschreven proces https://nl.player.fm/legal volgen.

Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
Unfortunately, the most natural computational unit of the neural network – the neuron itself – turns out not to be a natural unit for human understanding. This is because many neurons are polysemantic: they respond to mixtures of seemingly unrelated inputs. In the vision model Inception v1, a single neuron responds to faces of cats and fronts of cars . In a small language model we discuss in this paper, a single neuron responds to a mixture of academic citations, English dialogue, HTTP requests, and Korean text. Polysemanticity makes it difficult to reason about the behavior of the network in terms of the activity of individual neurons.
Source:
https://transformer-circuits.pub/2023/monosemantic-features/index.html
Narrated for AI Safety Fundamentals by Perrin Walker

A podcast by BlueDot Impact.
Learn more on the AI Safety Fundamentals website.

  continue reading

Hoofdstukken

1. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (00:00:00)

2. Summary of Results (00:05:50)

85 afleveringen

Artwork
iconDelen
 

Gearchiveerde serie ("Inactieve feed" status)

When? This feed was archived on February 21, 2025 21:08 (2M ago). Last successful fetch was on January 02, 2025 12:05 (3M ago)

Why? Inactieve feed status. Onze servers konden geen geldige podcast feed ononderbroken ophalen.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Manage episode 424744796 series 3498845
Inhoud geleverd door BlueDot Impact. Alle podcastinhoud, inclusief afleveringen, afbeeldingen en podcastbeschrijvingen, wordt rechtstreeks geüpload en geleverd door BlueDot Impact of hun podcastplatformpartner. Als u denkt dat iemand uw auteursrechtelijk beschermde werk zonder uw toestemming gebruikt, kunt u het hier beschreven proces https://nl.player.fm/legal volgen.

Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
Unfortunately, the most natural computational unit of the neural network – the neuron itself – turns out not to be a natural unit for human understanding. This is because many neurons are polysemantic: they respond to mixtures of seemingly unrelated inputs. In the vision model Inception v1, a single neuron responds to faces of cats and fronts of cars . In a small language model we discuss in this paper, a single neuron responds to a mixture of academic citations, English dialogue, HTTP requests, and Korean text. Polysemanticity makes it difficult to reason about the behavior of the network in terms of the activity of individual neurons.
Source:
https://transformer-circuits.pub/2023/monosemantic-features/index.html
Narrated for AI Safety Fundamentals by Perrin Walker

A podcast by BlueDot Impact.
Learn more on the AI Safety Fundamentals website.

  continue reading

Hoofdstukken

1. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (00:00:00)

2. Summary of Results (00:05:50)

85 afleveringen

Alle afleveringen

×
 
Loading …

Welkom op Player FM!

Player FM scant het web op podcasts van hoge kwaliteit waarvan u nu kunt genieten. Het is de beste podcast-app en werkt op Android, iPhone en internet. Aanmelden om abonnementen op verschillende apparaten te synchroniseren.

 

Korte handleiding

Luister naar deze show terwijl je op verkenning gaat
Spelen