Submitted by akhaliq 61 MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models · 12 authors 3
Submitted by akhaliq 55 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters · 4 authors 3
Submitted by akhaliq 39 An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion · 4 authors 3
Submitted by akhaliq 28 MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine · 11 authors 2
Submitted by akhaliq 22 IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts · 6 authors 2
Submitted by akhaliq 15 CoverBench: A Challenging Benchmark for Complex Claim Verification · 8 authors 2
Submitted by akhaliq 11 ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer · 13 authors 2
Submitted by Bowieee 10 StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation · 7 authors 2
Submitted by MarkWang 4 AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation · 7 authors 2