AccessEval: Benchmarking Disability Bias in Large Language Models Paper • 2509.22703 • Published Sep 22, 2025 • 20
PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications Paper • 2509.23879 • Published Sep 28, 2025 • 20
RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks Paper • 2509.23673 • Published Sep 28, 2025 • 20
Aligning LLMs for Multilingual Consistency in Enterprise Applications Paper • 2509.23659 • Published Sep 28, 2025 • 20
view article Article 🐺🐦⬛ LLM Comparison/Test: Phi-4, Qwen2 VL 72B Instruct, Aya Expanse 32B in my updated MMLU-Pro CS benchmark Jan 10, 2025 • 8
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation Paper • 2506.00482 • Published May 31, 2025 • 8