publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2024
- Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs2024We apply latent adversarial training to LLMs, allowing us to improve jailbreak robustness, remove backdoors (a la sleeper agents), and unlearn dangerous/unwanted knowledge robustly.
- Robust Unlearning via Mechanistic LocalizationsIn ICML 2024 Workshop on Mechanistic Interpretability, 2024Selected as a spotlight! In this preprint, we find that high-level manual understanding of various model components in knowledge retrieval informs significantly more robust unlearning with fewer side effects. In comparison, previous automated interpretability and localization approaches used for editing/unlearning are in a sense more precise but significantly less robust. More work to come!
-
2023
- Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and PatchingIn NeurIPS 2023 Workshop for Socially Responsible Language Modelling Research, 2023
- Representation Engineering: A Top-Down Approach to AI Transparency2023We characterize the area of Representation Engineering, an approach to practical interpretability that places population-level representations at the center of analysis. We gain traction on a range of safety-relevant problems by monitoring and manipulating high-level features of truthfulness, memorization, power-seeking, and more.
- Prune and Tune: Improving Efficient Pruning Techniques for Massive Language ModelsIn ICLR 2023 Tiny Papers Workshop, 2023Top 5% of submitted papers, invited to present!
2022
- Bandit-Based Multi-Start Strategies for Global Continuous OptimizationIn 2022 Winter Simulation Conference (WSC), 2022