Phillip H. Guo
Hi! I’m Phillip, an undergraduate student at the University of Maryland. I work on AI safety and alignment, and this past summer I was a quant trading intern at Jane Street.
My current research interests are a mixture of LLM adversarial robustness and interpretability. In adversarial robustness, I’m working on latent adversarial training to prevent jailbreaks and backdoors and to unlearn dangerous knowledge. In interpretability, I’m thinking about how to use advances in model understanding for more robust unlearning, anomaly detection, and model feature steering.
I was previously a ML Alignment and Theory scholar with Stephen Casper, a participant in ARENA 2.0, and a 2022 Atlas Fellow.
news
Jul 23, 2024 | I’ll be in Vienna at the ICML NextGenAISafety and Mech Interp workshops this week! I’ll be presenting a spotlight on our Robust Mechanistic Unlearning paper at the MI workshop and a poster in both workshops. |
---|
selected publications
- Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs2024We apply latent adversarial training to LLMs, allowing us to improve jailbreak robustness, remove backdoors (a la sleeper agents), and unlearn dangerous/unwanted knowledge robustly.
- Robust Unlearning via Mechanistic LocalizationsIn ICML 2024 Workshop on Mechanistic Interpretability, 2024Selected as a spotlight! In this preprint, we find that high-level manual understanding of various model components in knowledge retrieval informs significantly more robust unlearning with fewer side effects. In comparison, previous automated interpretability and localization approaches used for editing/unlearning are in a sense more precise but significantly less robust. More work to come!
- Representation Engineering: A Top-Down Approach to AI Transparency2023We characterize the area of Representation Engineering, an approach to practical interpretability that places population-level representations at the center of analysis. We gain traction on a range of safety-relevant problems by monitoring and manipulating high-level features of truthfulness, memorization, power-seeking, and more.