Neural Surgery: Examining the Inner Workings of A.I.

I spend my days poking and prodding neurons to better understand how information is learned, stored, and retrieved…but not in humans. I study a class of artificial intelligence algorithms called neural networks. Neural networks captured the public eye in 2022 with the unveiling of ChatGPT: a publicly accessible language model. Language models are a specific type of neural network which are trained to produce language. Within 5 days of its unveiling, ChatGPT had 1 million users. For reference, it took popular social media platform Facebook 10 months to gain 1 million users. Before long, people began attempting to use language models like ChatGPT for diverse tasks, ranging from mundane recipe planning to parsing complex scientific literature, with mixed results.

Despite their rapid adoption, the mechanisms guiding language model behavior are not well understood. This makes it difficult to understand the limits of language models and plan for appropriate use cases. For example, the sequential modeling technology that underpins language models, when applied to modeling virus genomes, has shown promise for predicting viral evolution: researchers are better able to predict how diseases, like COVID-19, mutate. This could be hugely impactful for developing preventive vaccines and planning for/preventing future pandemics. On the flip side, language modeling technology has its fair share of failures. We have already seen language models learn and amplify racial biases/stereotypes, “hallucinate” believable but incorrect facts, and reference realistic but fake scientific work and court cases.

My research aims to address these issues by 1) understanding why neural networks, like language models, behave the way they do and 2) developing methods to allow humans to efficiently correct unwanted/incorrect neural network behavior.

A recent project of mine focused on understanding why language models are prone to regurgitating copyrighted material, and developing methods to mitigate this behavior [1]. I tackled this project from a few angles:

Can we attribute specific neural network behaviors to specific subsets of training data? Neural networks are trained to detect patterns from large datasets. Similar to how a “line-of-best-fit” estimates a linear relationship between two variables (e.g., hours worked vs. money earned), neural networks are “trained” to estimate high dimensional relationships that represent patterns across millions (if not billions/trillions) of data points (e.g., what word should come next in a sentence). In my research, I study the behavior where language models learn that the best way to fit their training data is to simply regurgitate it verbatim. In the case of copyrighted text, this can become an issue. For example, language models can learn to regurgitate copyrighted news articles in their training data. This hurts journalists who are not fairly compensated or credited for their work.

Can we pin-point which weights are most responsible for storing different types of information? During training, neural networks compress patterns from the training data and store it in their weights. These trained weights can process an input and map it to an output. In the case of language modeling, a model may use different sets of weights to store information about different types of text. For example, my research finds that copyrighted news articles are stored in a small subset of a language model’s weights; when a user asks a news-related question to the model, it may rely more heavily on this subset of weights to produce copyrighted content.

Can we conduct model surgery to correct unwanted behavior? As neural networks become pervasive in human computing applications, it becomes necessary to correct their unwanted behaviors. This is complicated by the fact that the number of weights in neural networks are rapidly increasing to the billion to trillion weight range. Training such models in the first place is very computationally expensive. In the case where the model has learned undesirable behavior, such as memorizing a copyrighted article, one may consider retraining the model on data without the “memorized” data in an attempt to stem this behavior. However, re-training such models to correct model behavior is extremely expensive, and sometimes flat-out prohibitive. My research overcomes this computational block by selectively intervening on the neural network weights most responsible for content regurgitation. In other words, by carefully mapping model behaviors to specific weights, we can do “model surgery” to either excise or update a small set of “bad” weights, rather than retraining an entire model from scratch (a very costly alternative). In my research, we develop different types of “model surgeries” to remove a trained language model’s ability to produce copyrighted information.

In a world that is rapidly adopting neural networks into a wide range of computational applications, it is important to understand how these models work. We have seen nobel prizes awarded in the natural sciences for work done on and with neural networks. Millions of people are incorporating language models into their personal and professional lives. Neural network controlled cars are being deployed in cities around the world. Raging neural network-driven internet chatbots are permeating virtually every social media platform. Neural networks are part of humanity’s new normal. My work enables artificial intelligence practitioners to understand how data influences neural network behavior, pin-point the source of different behavior within model weights, and efficiently correct “bad” behavior. Better interpreting how neural networks work, understanding their limits, and having strategies to guide their behavior are key to a future in which AI models are deployed, regulated, and managed in a safe, useful, and ethical manner.

Thanks to Jay Sakarvadia and Valerie Hayot-Sasson, who provided feedback on this blog post.

References

[1] Sakarvadia, Mansi, et al. “Mitigating Memorization In Language Models.” International Conference on Learning Representations (2025).