Phillip H. Guo
Hi! I’m Phillip, an undergraduate student at the University of Maryland. I’m currently working part-time at Grayswan AI on automated agent red-teaming, and this past summer I was a quant trading intern at Jane Street.
My current research interests are a mixture of LLM adversarial robustness and interpretability. In adversarial robustness, I’m working on latent adversarial training to prevent jailbreaks and backdoors and to unlearn dangerous knowledge. In interpretability, I’m thinking about how to use advances in model understanding for more robust monitoring, unlearning, and steering.
I was previously a ML Alignment and Theory scholar with Stephen Casper, a participant in ARENA 2.0, and a 2022 Atlas Fellow.
news
Jul 23, 2024 | I’ll be in Vienna at the ICML NextGenAISafety and Mech Interp workshops this week! I’ll be presenting a spotlight on our Robust Mechanistic Unlearning paper at the MI workshop and a poster in both workshops. |
---|
selected publications
- Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization2024We find that high level understanding of model components involved in knowledge retrieval informs significantly more robust model editing without side effects. In comparison, previous automated localization approaches used for editing/unlearning are in a sense more precise but significantly less robust. The preprint version of this paper received a spotlight at the ICML 2024 Mechanistic Interpretability Workshop.
- Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs2024We apply latent adversarial training to LLMs, allowing us to improve jailbreak robustness, remove backdoors (a la sleeper agents), and unlearn dangerous/unwanted knowledge robustly.
- Representation Engineering: A Top-Down Approach to AI Transparency2023We characterize the area of Representation Engineering, an approach to practical interpretability that places population-level representations at the center of analysis. We gain traction on a range of safety-relevant problems by monitoring and manipulating high-level features of truthfulness, memorization, power-seeking, and more.