publications | Mansi Sakarvadia

2025

Preprint
Topology-Aware Knowledge Propagation in Decentralized Learning

Mansi Sakarvadia, Nathaniel Hudson, Tian Li, and 2 more authors

2025

Abs arXiv Bib HTML

Decentralized learning enables collaborative training of models across naturally distributed data without centralized coordination or maintenance of a global model. Instead, devices are organized in arbitrary communication topologies, in which they can only communicate with neighboring devices. Each device maintains its own local model by training on its local data and integrating new knowledge via model aggregation with neighbors. Therefore, knowledge is propagated across the topology via successive aggregation rounds. We study, in particular, the propagation of out-of-distribution (OOD) knowledge. We find that popular decentralized learning algorithms struggle to propagate OOD knowledge effectively to all devices. Further, we find that both the location of OOD data within a topology, and the topology itself, significantly impact OOD knowledge propagation. We then propose topology-aware aggregation strategies to accelerate (OOD) knowledge propagation across devices. These strategies improve OOD data accuracy, compared to topology-unaware baselines, by 123% on average across models in a topology.
@article{sakarvadia2025topology, title = {Topology-Aware Knowledge Propagation in Decentralized Learning}, author = {Sakarvadia, Mansi and Hudson, Nathaniel and Li, Tian and Foster, Ian and Chard, Kyle}, year = {2025}, }
Preprint
Empowering Scientific Workflows with Federated Agents

J Gregory Pauloski, Yadu Babuji, Ryan Chard, and 3 more authors

2025

Abs arXiv Bib

Agentic systems, in which diverse agents cooperate to tackle challenging problems, are exploding in popularity in the AI community. However, the agentic frameworks used to build these systems have not previously enabled use with research cyberinfrastructure. Here we introduce Academy, a modular and extensible middleware designed to deploy autonomous agents across the federated research ecosystem, including HPC systems, experimental facilities, and data repositories. To meet the demands of scientific computing, Academy supports asynchronous execution, heterogeneous resources, high-throughput data flows, and dynamic resource availability. It provides abstractions for expressing stateful agents, managing inter-agent coordination, and integrating computation with experimental control. We present microbenchmark results that demonstrate high performance and scalability in HPC environments. To demonstrate the breadth of applications that can be supported by agentic workflow designs, we also present case studies in materials discovery, decentralized learning, and information extraction in which agents are deployed across diverse HPC systems.
@article{pauloski2025empowering, title = {Empowering Scientific Workflows with Federated Agents}, author = {Pauloski, J Gregory and Babuji, Yadu and Chard, Ryan and Sakarvadia, Mansi and Chard, Kyle and Foster, Ian}, year = {2025}, }
ICLR
Mitigating Memorization In Language Models

Mansi Sakarvadia, Aswathy Ajith, Arham Khan, and 6 more authors

2025

Spotlight (top 5%)

Abs arXiv Bib HTML

Language models (LMs) can "memorize" information, i.e., encode training data in their weights in such a way that inference-time queries can lead to verbatim regurgitation of that data. This ability to extract training data can be problematic, for example, when data are private or sensitive. In this work, we investigate methods to mitigate memorization: three regularizer-based, three finetuning-based, and eleven machine unlearning-based methods, with five of the latter being new methods that we introduce. We also introduce TinyMem, a suite of small, computationally-efficient LMs for the rapid development and evaluation of memorization-mitigation methods. We demonstrate that the mitigation methods that we develop using TinyMem can successfully be applied to production-grade LMs, and we determine via experiment that: regularizer-based mitigation methods are slow and ineffective at curbing memorization; fine-tuning-based methods are effective at curbing memorization, but overly expensive, especially for retaining higher accuracies; and unlearning-based methods are faster and more effective, allowing for the precise localization and removal of memorized information from LM weights prior to inference. We show, in particular, that our proposed unlearning method BalancedSubnet outperforms other mitigation methods at removing memorized information while preserving performance on target tasks.
@article{sakarvadia2024Mitigating, title = {Mitigating Memorization In Language Models}, author = {Sakarvadia, Mansi and Ajith, Aswathy and Khan, Arham and Hudson, Nathaniel and Geniesse, Caleb and Chard, Kyle and Yang, Yaoqing and Foster, Ian and Mahoney, Michael W.}, year = {2025}, note = {Spotlight (top 5\%)}, publisher = {International Conference on Learning Representations, 2025}, }

2024

Preprint
SoK: On Finding Common Ground in Loss Landscapes Using Deep Model Merging Techniques

Arham Khan, Todd Nief, Nathaniel Hudson, and 6 more authors

2024

Abs arXiv Bib

Understanding neural networks is crucial to creating reliable and trustworthy deep learning models. Most contemporary research in interpretability analyzes just one model at a time via causal intervention or activation analysis. Yet despite successes, these methods leave significant gaps in our understanding of the training behaviors of neural networks, how their inner representations emerge, and how we can predictably associate model components with task-specific behaviors. Seeking new insights from work in related fields, here we survey literature in the field of model merging, a field that aims to combine the abilities of various neural networks by merging their parameters and identifying task-specific model components in the process. We analyze the model merging literature through the lens of loss landscape geometry, an approach that enables us to connect observations from empirical studies on interpretability, security, model merging, and loss landscape analysis to phenomena that govern neural network training and the emergence of their inner representations. To systematize knowledge in this area, we present a novel taxonomy of model merging techniques organized by their core algorithmic principles. Additionally, we distill repeated empirical observations from the literature in these fields into characterizations of four major aspects of loss landscape geometry: mode convexity, determinism, directedness, and connectivity. We argue that by improving our understanding of the principles underlying model merging and loss landscape geometry, this work contributes to the goal of ensuring secure and trustworthy machine learning in practice.
@article{khan2024sok, title = {SoK: On Finding Common Ground in Loss Landscapes Using Deep Model Merging Techniques}, author = {Khan, Arham and Nief, Todd and Hudson, Nathaniel and Sakarvadia, Mansi and Grzenda, Daniel and Ajith, Aswathy and Pettyjohn, Jordan and Chard, Kyle and Foster, Ian}, year = {2024}, }
M.S. Thesis
Towards Interpreting Language Models: A Case Study in Multi-Hop Reasoning

Mansi Sakarvdia

University of Chicago, 2024

Abs arXiv Bib PDF

Answering multi-hop reasoning questions requires retrieving and synthesizing information from diverse sources. Language models (LMs) struggle to perform such reasoning consistently. We propose an approach to pinpoint and rectify multi-hop reasoning failures through targeted memory injections on LM attention heads. First, we analyze the per-layer activations of GPT-2 models in response to single- and multi-hop prompts. We then propose a mechanism that allows users to inject relevant prompt-specific information, which we refer to as “memories,” at critical LM locations during inference. By thus enabling the LM to incorporate additional relevant information during inference, we enhance the quality of multi-hop prompt completions. We empirically show that a simple, efficient, and targeted memory injection into a key attention layer often increases the probability of the desired next token in multi-hop tasks, by up to 424%. We observe that small subsets of attention heads can significantly impact the model prediction during multi-hop reasoning. To more faithfully interpret these heads, we develop Attention Lens: an open source tool that translates the outputs of attention heads into vocabulary tokens via learned transformations called lenses. We demonstrate the use of lenses to reveal how a model arrives at its answer and use them to localize sources of model failures such as in the case of biased and malicious language generation.
@mastersthesis{Sakarvadia2024Interpreting, title = {Towards Interpreting Language Models: A Case Study in Multi-Hop Reasoning}, author = {Sakarvdia, Mansi}, year = {2024}, school = {University of Chicago}, }

2023

BlackboxNLP
Memory Injections: Correcting Multi-Hop Reasoning Failures during Inference in Transformer-Based Language Models

Mansi Sakarvadia, Aswathy Ajith, Arham Khan, and 5 more authors

2023

Work accepted to BlackBoxNLP 2023.

Abs arXiv Bib

Answering multi-hop reasoning questions requires retrieving and synthesizing information from diverse sources. Large Language Models (LLMs) struggle to perform such reasoning consistently. Here we propose an approach to pinpoint and rectify multi-hop reasoning failures through targeted memory injections on LLM attention heads. First, we analyze the per-layer activations of GPT-2 models in response to single and multi-hop prompts. We then propose a mechanism that allows users to inject pertinent prompt-specific information, which we refer to as "memories," at critical LLM locations during inference. By thus enabling the LLM to incorporate additional relevant information during inference, we enhance the quality of multi-hop prompt completions. We show empirically that a simple, efficient, and targeted memory injection into a key attention layer can often increase the probability of the desired next token in multi-hop tasks, by up to 424%.
@article{sakarvadia2023memory, title = {Memory Injections: Correcting Multi-Hop Reasoning Failures during Inference in Transformer-Based Language Models}, author = {Sakarvadia, Mansi and Ajith, Aswathy and Khan, Arham and Grzenda, Daniel and Hudson, Nathaniel and Bauer, André and Chard, Kyle and Foster, Ian}, year = {2023}, note = {Work accepted to BlackBoxNLP 2023.}, }
ATTRIB
Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism

Mansi Sakarvadia, Arham Khan, Aswathy Ajith, and 5 more authors

2023

Accepted to Workshop on Attributing Model Behavior At Scale (ATTRIB) Workshop @ NeurIPS.

Abs arXiv Bib

Transformer-based Large Language Models (LLMs) are the state-of-the-art for natural language tasks. Much recent work has attempted to decode the internal mechanisms by which LLMs arrive at their final predictions for text completion tasks, including by reverse-engineering the role of linear layers. Yet little is known about the role of attention heads in producing the final token prediction. We propose the Attention Lens, a tool that enables researchers to translate the outputs of attention heads into vocabulary tokens via learned attention head-specific transformations called lenses. Preliminary findings from our trained lenses indicate that attention heads play highly specialized and specific roles in language models.
@article{sakarvadia2023attention, title = {Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism}, author = {Sakarvadia, Mansi and Khan, Arham and Ajith, Aswathy and Grzenda, Daniel and Hudson, Nathaniel and Bauer, André and Chard, Kyle and Foster, Ian}, year = {2023}, note = {Accepted to Workshop on Attributing Model Behavior At Scale (ATTRIB) Workshop @ NeurIPS.}, }
BDCAT
Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision

Nathaniel Hudson, J. Gregory Pauloski, Matt Baughman, and 13 more authors

In IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT2023), 2023

Abs Bib

Deep learning methods are transforming research, enabling new techniques, and ultimately leading to new discoveries. As the demand for more capable AI models continues to grow, we are now entering an era of Trillion Parameter Models (TPM), or models with more than a trillion parameters—such as Huawei’s PanGu-Σ. We describe a vision for the ecosystem of TPM users and providers that caters to the specific needs of the scientific community. We then outline the significant technical challenges and open problems in system design for serving TPMs to enable scientific research and discovery. Specifically, we describe the requirements of a comprehensive software stack and interfaces to support the diverse and flexible requirements of researchers.
@inproceedings{Hudson2023trillion, title = {Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision}, author = {Hudson, Nathaniel and Pauloski, J. Gregory and Baughman, Matt and Kamatar, Alok and Sakarvadia, Mansi and Ward, Logan and Chard, Ryan and Bauer, Andre and Levental, Maksim and Wang, Wenyi and Engler, Will and Skelly, Owen Price and Blaszik, Ben and Stevens, Rick and Chard, Kyle and Foster, Ian}, year = {2023}, booktitle = {IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT2023)}, }
e-Science
Lazy Python Dependency Management in Large-Scale Systems

Alok Kamatar, Mansi Sakarvadia, Valerie Hayot-Sasson, and 2 more authors

In 2023 IEEE 19th International Conference on e-Science (e-Science), 2023

Abs Bib

Python has become the language of choice for managing many scientific applications. However, when distributing a Python application, it is necessary that all application dependencies be distributed and available in the target execution environment. A specific consequence is that Python workflows suffer from slow scale out due to the time required to import dependencies. We describe ProxyImports, a method to package and distribute Python dependencies in a lazy fashion while remaining transparent and easy to use. Using ProxyImports, Python packages are loaded only once (eg, by a workflow head node) and are transferred asynchronously to compute nodes. We evaluate our implementation on the Perlmutter and Theta supercomputers and in an HPC cloud-bursting scenario. Our experiments show that ProxyImports significantly reduces the average time to import large modules across an HPC system and demonstrate that this method can be used easily to distribute user-packages to cloud resources. We conclude that ProxyImports improves application runtime, reduces contention on metadata servers and facilitates runtime portability of Python applications.
@inproceedings{kamatar2023lazy, title = {Lazy Python Dependency Management in Large-Scale Systems}, author = {Kamatar, Alok and Sakarvadia, Mansi and Hayot-Sasson, Valerie and Chard, Kyle and Foster, Ian}, booktitle = {2023 IEEE 19th International Conference on e-Science (e-Science)}, pages = {1--10}, year = {2023}, organization = {IEEE Computer Society}, }

2020

MICCAI
Atypical Neonate Extra-axial CSF is Associated with Reduced Cognitive Development at Age 1 year (poster)

Mansi Sakarvadia, Rui Li, SunHyung Kim, and 6 more authors

Perinatal Preterm and Pediatric Image Analysis workshop at the Medical Image Computing and Computer Assisted Interventions conference, 2020

Abs Bib PDF

We aim to assess if enlarged extra-axial cerebrospinal fluid (EA-CSF) volume at neonatal age is associated with a child’s performance on the Mullen Scales of Early Learning (MSEL) at 12 and 24 months of age. 3T MRI scans were acquired from 651 infants at neonate age (20.8+/-8.9 postnatal days). EA-CSF and global tissue volumes were computed via a new tool called AutoEACSF1. The MSEL was administered to these infants at 12 and 24 months, measuring ability in gross motor and four domains that comprise an overall cognitive composite score: fine motor, visual reception, receptive language, expressive language. General linear models including intracranial cavity volume, gestational age at birth, maternal education and sex as covariate were employed. The subgroup of infants whose EA-CSF volumes measured in the top 5th percentile (i.e., 2 SDs above the mean; n=33) displayed significant negative correlations between elevated EA-CSF at neonatal age and expressive language (p=0.001) and cognitive composite scores (p=0.016) at 12 months. However, at 24 months of age, these associations were no longer significant. No significant associations were found for subjects with EACSF volumes below the top 10th percentile. This study finds that atypically high levels of EA-CSF volume shortly after birth are associated with lower expressive language and overall cognitive ability at 12 months of age. These results suggest that there may be a pathological threshold of high EA-CSF volume that could serve as an early biomarker of a child’s reduced cognitive ability at 12 months.
@article{GlobalEACSF, author = {Sakarvadia, Mansi and Li, Rui and Kim, SunHyung and Murphy, Veronica and Cornea, Emil and Prieto, Juan Carlos and Shen, Mark and Gilmore, John and Styner, Martin}, year = {2020}, journal = {Perinatal Preterm and Pediatric Image Analysis workshop at the Medical Image Computing and Computer Assisted Interventions conference}, }