publications

publications by categories in reversed chronological order. generated by jekyll-scholar.

2024

  1. Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
    Abhay Sheshadri*, Aidan Ewart*Phillip Guo*, Aengus Lynch*, Cindy Wu*, and 6 more authors
    2024
    We apply latent adversarial training to LLMs, allowing us to improve jailbreak robustness, remove backdoors (a la sleeper agents), and unlearn dangerous/unwanted knowledge robustly.
  2. Robust Unlearning via Mechanistic Localizations
    Phillip Huang Guo*, Aaquib Syed*, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite
    In ICML 2024 Workshop on Mechanistic Interpretability, 2024
    :trophy: Selected as a spotlight! In this preprint, we find that high-level manual understanding of various model components in knowledge retrieval informs significantly more robust unlearning with fewer side effects. In comparison, previous automated interpretability and localization approaches used for editing/unlearning are in a sense more precise but significantly less robust. More work to come!
  3. Eight Methods to Evaluate Robust Unlearning in LLMs
    Aengus Lynch*Phillip Guo*, Aidan Ewart*, Stephen Casper, and Dylan Hadfield-Menell
    2024

2023

  1. Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching
    James Campbell*, Richard Ren*, and Phillip Guo*
    In NeurIPS 2023 Workshop for Socially Responsible Language Modelling Research, 2023
  2. Representation Engineering: A Top-Down Approach to AI Transparency
    Andy Zou, Long Phan*, Sarah Chen*, James Campbell*Phillip Guo*, and 16 more authors
    2023
    We characterize the area of Representation Engineering, an approach to practical interpretability that places population-level representations at the center of analysis. We gain traction on a range of safety-relevant problems by monitoring and manipulating high-level features of truthfulness, memorization, power-seeking, and more.
  3. Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models
    Aaquib Syed*Phillip Huang Guo*, and Vijaykaarti Sundarapandiyan*
    In ICLR 2023 Tiny Papers Workshop, 2023
    :trophy: Top 5% of submitted papers, invited to present!

2022

  1. Bandit-Based Multi-Start Strategies for Global Continuous Optimization
    Phillip Guo, and Michael C. Fu
    In 2022 Winter Simulation Conference (WSC), 2022