Bridging Your Imagination with Audio-Video Generation via a Unified Director
Abstract
A unified director model leveraging a Mixture-of-Transformers architecture with interleaved and disentangled learning generates coherent video scripts and consistent keyframes through a single framework.
Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.
Community
UniMAGE unifies script writing and keyframe generation for long-context video creation using Mixture-of-Transformers and a two-stage interleaving/disentangling training paradigm.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions (2025)
- Loom: Diffusion-Transformer for Interleaved Generation (2025)
- OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory (2025)
- MultiShotMaster: A Controllable Multi-Shot Video Generation Framework (2025)
- StoryMem: Multi-shot Long Video Storytelling with Memory (2025)
- DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation (2025)
- ConsistCompose: Unified Multimodal Layout Control for Image Composition (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper