LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework Paper • 2507.04723 • Published Jul 7 • 10
AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs Paper • 2507.05687 • Published Jul 8 • 26
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning Paper • 2505.08054 • Published May 12 • 2
SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions Paper • 2506.00643 • Published May 31 • 5