Measuring what Matters: Construct Validity in Large Language Model Benchmarks Paper • 2511.04703 • Published Nov 3 • 7
Clinical knowledge in LLMs does not translate to human interactions Paper • 2504.18919 • Published Apr 26 • 26
LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation Paper • 2503.02972 • Published Mar 4 • 25