arxiv:2509.00428

Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

Published on Aug 30

· Submitted by

XavierJiezou on Sep 4

Upvote

Authors:

Xuechao Zou ,

Abstract

Face-MoGLE, a novel framework using Diffusion Transformers, achieves high-quality, controllable face generation through semantic-decoupled latent modeling, expert specialization, and dynamic gating.

AI-generated summary

Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at https://github.com/XavierJiezou/Face-MoGLE.

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

XavierJiezou

Paper author Paper submitter 1 day ago

Face-MoGLE: Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

Focusing on the challenge in generative models of balancing semantic controllability and realism.

🧠 Research Motivation

Traditional approaches often struggle to balance global consistency with local detail, or overly entangle semantics with the generation process, leading to insufficient flexibility and limited generalization.

✨ Core Contributions

1️⃣ Propose a semantic-decoupled latent modeling method, enabling precise regional manipulation and strong generalization through mask-conditioned decomposition;

2️⃣ Design the Global-Local Mixture of Experts module (MoGLE), which simultaneously captures overall structure and restores fine-grained local semantic details;

3️⃣ Introduce a spatio-temporal dynamic gating network, which adaptively fuses representations according to both diffusion steps and spatial positions.

📊 Experimental Results

Face-MoGLE significantly outperforms mainstream models such as PixelFace+ and DDGI on FID, KID, and conditional consistency metrics. It also demonstrates strong zero-shot generalization. The generated faces can even deceive state-of-the-art face forgery detection systems, highlighting its potential implications for security. As a next step, we plan to release dedicated forgery-detection models and evaluation tools, aiming to advance generative AI while providing reliable safeguards for society.

📦 Fully Open-Sourced

We have released the paper, code, model weights, and related datasets. Community use and feedback are warmly welcome!

◼︎ Paper: https://arxiv.org/pdf/2509.00428

◼︎ Code: https://github.com/XavierJiezou/Face-MoGLE

◼︎ Models: https://huggingface.co/XavierJiezou/face-mogle-models

◼︎ Data: https://huggingface.co/datasets/XavierJiezou/face-mogle-datasets

#FaceGeneration #DiffusionModels #ComputerVision #GenerativeAI