HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks
Abstract
Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While Large Multimodal Models (LMMs) have demonstrated impressive capabilities across various tasks, existing benchmarks lack comprehensive evaluation of their diagram interpretation and reasoning abilities, particularly in coding contexts. We present HumanEval-V, a rigorous benchmark of human-annotated coding tasks that spans six task types and evaluates diverse visual reasoning capabilities. Each task features carefully crafted diagrams paired with function signatures and test cases, employing novel code generation tasks to thoroughly assess models' diagram comprehension. Through extensive experiments with 22 LMMs, we find that even top-performing models achieve modest success rates, with Claude 3.5 Sonnet reaching only 36.8% pass@1, highlighting substantial room for improvement. Our analysis reveals that current LMMs struggle with spatial transformations, topological relationships, and dynamic patterns that humans find intuitive. These findings provide valuable insights for advancing LMMs' visual reasoning abilities. We have open-sourced our code and benchmark at https://github.com/HumanEval-V/HumanEval-V-Benchmark.
Community
We introduce HumanEval-V, a Lightweight Visual Understanding and Reasoning Benchmark for Evaluating LMMs through Coding Tasks, the first benchmark where all the visual information plays an essential role in solving coding tasks.
We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1
homepage and leaderboard at https://humaneval-v.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection (2024)
- MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models (2024)
- MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models (2024)
- MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark (2024)
- ING-VP: MLLMs cannot Play Easy Vision-based Games Yet (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper