MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Paper • 2502.10391 • Published 9 days ago • 29
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper • 2412.04424 • Published Dec 5, 2024 • 60
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding Paper • 2412.00493 • Published Nov 30, 2024 • 17
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published Dec 13, 2024 • 92
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published Dec 13, 2024 • 140 • 12
SCBench: A KV Cache-Centric Analysis of Long-Context Methods Paper • 2412.10319 • Published Dec 13, 2024 • 10
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published Dec 13, 2024 • 140
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation Paper • 2412.03069 • Published Dec 4, 2024 • 31
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper • 2412.04467 • Published Dec 5, 2024 • 107
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases Paper • 2412.04862 • Published Dec 6, 2024 • 50
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models Paper • 2409.13592 • Published Sep 20, 2024 • 50 • 9
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Paper • 2405.21060 • Published May 31, 2024 • 64 • 3
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Paper • 2405.21060 • Published May 31, 2024 • 64