Stephen Casper's picture

3 1 3

Stephen Casper

stecas

·

https://stephencasper.com/

AI & ML interests

None yet

Organizations

authored a paper 7 months ago

Open Problems in Mechanistic Interpretability

Paper • 2501.16496 • Published Jan 27 • 19

authored 3 papers about 1 year ago

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Paper • 2403.05030 • Published Mar 8, 2024

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Paper • 2407.15549 • Published Jul 22, 2024

Eight Methods to Evaluate Robust Unlearning in LLMs

Paper • 2402.16835 • Published Feb 26, 2024

authored 2 papers over 1 year ago

Black-Box Access is Insufficient for Rigorous AI Audits

Paper • 2401.14446 • Published Jan 25, 2024 • 3

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Paper • 2312.03729 • Published Nov 27, 2023

authored a paper almost 2 years ago

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Paper • 2311.03348 • Published Nov 6, 2023 • 1

authored 2 papers about 2 years ago

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Paper • 2307.15217 • Published Jul 27, 2023 • 38

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Paper • 2306.09442 • Published Jun 15, 2023 • 7