Planning with Reasoning using Vision Language World Model Paper • 2509.02722 • Published 3 days ago • 13
Few-shot Adaptation of Multi-modal Foundation Models: A Survey Paper • 2401.01736 • Published Jan 3, 2024
High-Dimension Human Value Representation in Large Language Models Paper • 2404.07900 • Published Apr 11, 2024 • 1
VirtualConductor: Music-driven Conducting Video Generation System Paper • 2108.04350 • Published Jul 28, 2021
Taming Diffusion Models for Music-driven Conducting Motion Generation Paper • 2306.10065 • Published Jun 15, 2023
ProtoCLIP: Prototypical Contrastive Language Image Pretraining Paper • 2206.10996 • Published Jun 22, 2022
Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model Paper • 2309.11000 • Published Sep 20, 2023 • 2
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing Paper • 2306.11029 • Published Jun 19, 2023 • 1