Defending Against Unforeseen Failure Modes with Latent Adversarial Training Paper • 2403.05030 • Published Mar 8, 2024
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Paper • 2407.15549 • Published Jul 22, 2024
Black-Box Access is Insufficient for Rigorous AI Audits Paper • 2401.14446 • Published Jan 25, 2024 • 3
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? Paper • 2312.03729 • Published Nov 27, 2023
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation Paper • 2311.03348 • Published Nov 6, 2023 • 1
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Paper • 2307.15217 • Published Jul 27, 2023 • 38
Explore, Establish, Exploit: Red Teaming Language Models from Scratch Paper • 2306.09442 • Published Jun 15, 2023 • 7