Mechanistic Interpretability Benchmark

university

https://mib-bench.github.io

AI & ML interests

Principled evaluation of mechanistic interpretability methods.

Recent Activity

hij authored a paper 9 days ago

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

hij authored a paper 9 days ago

LLMs Encode Harmfulness and Refusal Separately

hij authored a paper 9 days ago

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

View all activity

mib-bench 's models 3

mib-bench/mib-circuits-example

mib-bench/mib-causalvariable-example

mib-bench/interpbench