Phillip H. Guo

Hi! I’m Phillip, an undergraduate student at the University of Maryland. I’m currently working part-time at Grayswan AI on automated agent red-teaming, and this past summer I was a quant trading intern at Jane Street.

My current research interests are a mixture of LLM adversarial robustness and interpretability. In adversarial robustness, I’m working on latent adversarial training to prevent jailbreaks and backdoors and to unlearn dangerous knowledge. In interpretability, I’m thinking about how to use advances in model understanding for more robust monitoring, unlearning, and steering.

I was previously a ML Alignment and Theory scholar with Stephen Casper, a participant in ARENA 2.0, and a 2022 Atlas Fellow.

news

Jul 23, 2024	I’ll be in Vienna at the ICML NextGenAISafety and Mech Interp workshops this week! I’ll be presenting a spotlight on our Robust Mechanistic Unlearning paper at the MI workshop and a poster in both workshops.

selected publications

Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

Phillip Guo^*, Aaquib Syed^*, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite

2024

We find that high level understanding of model components involved in knowledge retrieval informs significantly more robust model editing without side effects. In comparison, previous automated localization approaches used for editing/unlearning are in a sense more precise but significantly less robust. The preprint version of this paper received a spotlight at the ICML 2024 Mechanistic Interpretability Workshop.

HTML
Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Abhay Sheshadri^*, Aidan Ewart^*, Phillip Guo^*, Aengus Lynch^*, Cindy Wu^*, and 6 more authors

2024

We apply latent adversarial training to LLMs, allowing us to improve jailbreak robustness, remove backdoors (a la sleeper agents), and unlearn dangerous/unwanted knowledge robustly.

HTML
Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan^*, Sarah Chen^*, James Campbell^*, Phillip Guo^*, and 16 more authors

2023

We characterize the area of Representation Engineering, an approach to practical interpretability that places population-level representations at the center of analysis. We gain traction on a range of safety-relevant problems by monitoring and manipulating high-level features of truthfulness, memorization, power-seeking, and more.

HTML