Budget-aware Test-time Scaling via Discriminative Verification Paper • 2510.14913 • Published Oct 16, 2025 • 4
Predicting Task Performance with Context-aware Scaling Laws Paper • 2510.14919 • Published Oct 16, 2025 • 3
RepIt: Representing Isolated Targets to Steer Language Models Paper • 2509.13281 • Published Sep 16, 2025 • 4
LLM Interpretability Collection Interpretability papers from Prof. Chenguang Wang's lab at UCSC • 3 items • Updated Sep 19, 2025
COSMIC: Generalized Refusal Direction Identification in LLM Activations Paper • 2506.00085 • Published May 30, 2025 • 2
RepIt: Representing Isolated Targets to Steer Language Models Paper • 2509.13281 • Published Sep 16, 2025 • 4
SteeringControl: Holistic Evaluation of Alignment Steering in LLMs Paper • 2509.13450 • Published Sep 16, 2025 • 7
SteeringControl: Holistic Evaluation of Alignment Steering in LLMs Paper • 2509.13450 • Published Sep 16, 2025 • 7 • 2
SteeringSafety Collection A benchmark for evaluating effectiveness and entanglement in representation steering across seven safety-relevant perspectives • 2 items • Updated Oct 20, 2025 • 1
SteeringSafety Collection A benchmark for evaluating effectiveness and entanglement in representation steering across seven safety-relevant perspectives • 2 items • Updated Oct 20, 2025 • 1
SteeringControl: Holistic Evaluation of Alignment Steering in LLMs Paper • 2509.13450 • Published Sep 16, 2025 • 7
SteeringSafety Collection A benchmark for evaluating effectiveness and entanglement in representation steering across seven safety-relevant perspectives • 2 items • Updated Oct 20, 2025 • 1