SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published 3 days ago • 99
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation Paper • 2502.09838 • Published 10 days ago • 9
You Do Not Fully Utilize Transformer's Representation Capacity Paper • 2502.09245 • Published 10 days ago • 30
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Paper • 2502.11089 • Published 7 days ago • 133
Music2Latent2: Audio Compression with Summary Embeddings and Autoregressive Decoding Paper • 2501.17578 • Published 25 days ago • 1
iFormer: Integrating ConvNet and Transformer for Mobile Application Paper • 2501.15369 • Published 29 days ago • 12
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine Paper • 2408.02900 • Published Aug 6, 2024 • 28
The Geometry of Tokens in Internal Representations of Large Language Models Paper • 2501.10573 • Published Jan 17 • 9
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding Paper • 2501.07783 • Published Jan 14 • 7
Generalized Gaussian Model for Learned Image Compression Paper • 2411.19320 • Published Nov 28, 2024 • 1
I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token Paper • 2412.06676 • Published Dec 9, 2024 • 9
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model Paper • 2411.17459 • Published Nov 26, 2024 • 11