Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos Paper • 2501.04001 • Published Jan 7 • 42
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Paper • 2501.03895 • Published Jan 7 • 49
An Empirical Study of Autoregressive Pre-training from Videos Paper • 2501.05453 • Published Jan 9 • 37
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training Paper • 2501.07556 • Published Jan 13 • 5
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos Paper • 2501.12375 • Published Jan 21 • 22
Intuitive physics understanding emerges from self-supervised pretraining on natural videos Paper • 2502.11831 • Published 6 days ago • 13