FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games Paper • 2509.01052 • Published 5 days ago • 18
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games Paper • 2509.01052 • Published 5 days ago • 18
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games Paper • 2509.01052 • Published 5 days ago • 18 • 1
ChartCap: Mitigating Hallucination of Dense Chart Captioning Paper • 2508.03164 • Published Aug 5 • 6
ChartCap: Mitigating Hallucination of Dense Chart Captioning Paper • 2508.03164 • Published Aug 5 • 6
ChartCap: Mitigating Hallucination of Dense Chart Captioning Paper • 2508.03164 • Published Aug 5 • 6 • 2
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games Paper • 2506.03610 • Published Jun 4 • 9
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games Paper • 2506.03610 • Published Jun 4 • 9
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates Paper • 2505.22943 • Published May 28 • 4
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates Paper • 2505.22943 • Published May 28 • 4 • 4
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates Paper • 2505.22943 • Published May 28 • 4 • 4
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates Paper • 2505.22943 • Published May 28 • 4
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates Paper • 2505.22943 • Published May 28 • 4 • 4
Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type Paper • 2502.06086 • Published Feb 10 • 1