publications

publications by categories in reversed chronological order. generated by jekyll-scholar.

2024

  1. Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
    Phillip Guo*, Aaquib Syed*, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite
    2024
    We find that high level understanding of model components involved in knowledge retrieval informs significantly more robust model editing without side effects. In comparison, previous automated localization approaches used for editing/unlearning are in a sense more precise but significantly less robust. :trophy: The preprint version of this paper received a spotlight at the ICML 2024 Mechanistic Interpretability Workshop.
  2. Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
    Abhay Sheshadri*, Aidan Ewart*Phillip Guo*, Aengus Lynch*, Cindy Wu*, and 6 more authors
    2024
    We apply latent adversarial training to LLMs, allowing us to improve jailbreak robustness, remove backdoors (a la sleeper agents), and unlearn dangerous/unwanted knowledge robustly.
  3. Robust Unlearning via Mechanistic Localizations
    Phillip Huang Guo*, Aaquib Syed*, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite
    In ICML 2024 Workshop on Mechanistic Interpretability, 2024
    :trophy: Selected as a spotlight! In this preprint, we find that high-level manual understanding of various model components in knowledge retrieval informs significantly more robust unlearning with fewer side effects. In comparison, previous automated interpretability and localization approaches used for editing/unlearning are in a sense more precise but significantly less robust.
  4. Eight Methods to Evaluate Robust Unlearning in LLMs
    Aengus Lynch*Phillip Guo*, Aidan Ewart*, Stephen Casper, and Dylan Hadfield-Menell
    2024

2023

  1. Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching
    James Campbell*, Richard Ren*, and Phillip Guo*
    In NeurIPS 2023 Workshop for Socially Responsible Language Modelling Research, 2023
  2. Representation Engineering: A Top-Down Approach to AI Transparency
    Andy Zou, Long Phan*, Sarah Chen*, James Campbell*Phillip Guo*, and 16 more authors
    2023
    We characterize the area of Representation Engineering, an approach to practical interpretability that places population-level representations at the center of analysis. We gain traction on a range of safety-relevant problems by monitoring and manipulating high-level features of truthfulness, memorization, power-seeking, and more.
  3. Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models
    Aaquib Syed*Phillip Huang Guo*, and Vijaykaarti Sundarapandiyan*
    In ICLR 2023 Tiny Papers Workshop, 2023
    :trophy: Top 5% of submitted papers, invited to present!

2022

  1. Bandit-Based Multi-Start Strategies for Global Continuous Optimization
    Phillip Guo, and Michael C. Fu
    In 2022 Winter Simulation Conference (WSC), 2022