publications | Phillip H. Guo

2024

Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

Phillip Guo^*, Aaquib Syed^*, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite

2024

We find that high level understanding of model components involved in knowledge retrieval informs significantly more robust model editing without side effects. In comparison, previous automated localization approaches used for editing/unlearning are in a sense more precise but significantly less robust. The preprint version of this paper received a spotlight at the ICML 2024 Mechanistic Interpretability Workshop.

HTML
Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Abhay Sheshadri^*, Aidan Ewart^*, Phillip Guo^*, Aengus Lynch^*, Cindy Wu^*, and 6 more authors

2024

We apply latent adversarial training to LLMs, allowing us to improve jailbreak robustness, remove backdoors (a la sleeper agents), and unlearn dangerous/unwanted knowledge robustly.

HTML
Robust Unlearning via Mechanistic Localizations

Phillip Huang Guo^*, Aaquib Syed^*, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite

In ICML 2024 Workshop on Mechanistic Interpretability, 2024

Selected as a spotlight! In this preprint, we find that high-level manual understanding of various model components in knowledge retrieval informs significantly more robust unlearning with fewer side effects. In comparison, previous automated interpretability and localization approaches used for editing/unlearning are in a sense more precise but significantly less robust.

HTML
Eight Methods to Evaluate Robust Unlearning in LLMs

Aengus Lynch^*, Phillip Guo^*, Aidan Ewart^*, Stephen Casper, and Dylan Hadfield-Menell

2024

HTML

2023

Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

James Campbell^*, Richard Ren^*, and Phillip Guo^*

In NeurIPS 2023 Workshop for Socially Responsible Language Modelling Research, 2023

HTML
Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan^*, Sarah Chen^*, James Campbell^*, Phillip Guo^*, and 16 more authors

2023

We characterize the area of Representation Engineering, an approach to practical interpretability that places population-level representations at the center of analysis. We gain traction on a range of safety-relevant problems by monitoring and manipulating high-level features of truthfulness, memorization, power-seeking, and more.

HTML
Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models

Aaquib Syed^*, Phillip Huang Guo^*, and Vijaykaarti Sundarapandiyan^*

In ICLR 2023 Tiny Papers Workshop, 2023

Top 5% of submitted papers, invited to present!

HTML

2022

Bandit-Based Multi-Start Strategies for Global Continuous Optimization

Phillip Guo, and Michael C. Fu

In 2022 Winter Simulation Conference (WSC), 2022

HTML