Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

Jinhui Yi*, Syed Talal Wasim*, Yanan Luo*, Muzammal Naseer, Juergen Gall

*Equal Contribution

University of Bonn; Lamarr Institute for Machine Learning and Artificial Intelligence; Khalifa University

| Paper | Code |

We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5x reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks. The fine-grained video question-answering evaluation demonstrates our model's effectiveness, outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key aspects like correctness and temporal understanding. Extensive ablation studies validate our architectural choices and demonstrate the effectiveness of our spatio-temporal modeling approach while achieving 3-4x faster processing speeds than previous methods.

Model Weights

We release the pretrained and instruction-tuned weights of Video-Panda in this repository.

โœ’๏ธ Citation

If Video-Panda is helpful for your research, please consider star โญ and citation ๐Ÿ“ :

@article{yi2024video-panda,
    author    = {Jinhui Yi* and Syed Talal Wasim* and Yanan Luo* and Muzammal Naseer and Juergen Gall},
    title     = {Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models},
    journal   = {arXiv preprint, arXiv:2412.18609},
    year      = {2024},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support video-text-to-text models for transformers library.