Dress&Dance: Dress up and Dance as You Like It - Technical Preview
Abstract
A video diffusion framework generates high-quality virtual try-on videos using a conditioning network that unifies multi-modal inputs, outperforming existing solutions.
We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.
Community
We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MuGa-VTON: Multi-Garment Virtual Try-On via Diffusion Transformers with Prompt Customization (2025)
- ChoreoMuse: Robust Music-to-Dance Video Generation with Style Transfer and Beat-Adherent Motion (2025)
- Animate-X++: Universal Character Image Animation with Dynamic Backgrounds (2025)
- HairShifter: Consistent and High-Fidelity Video Hair Transfer via Anchor-Guided Animation (2025)
- DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions (2025)
- Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars (2025)
- Democratizing High-Fidelity Co-Speech Gesture Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper