Zen-Dub-Live

Real-Time Speech-to-Speech Translation and Lip-Synchronized Video Dubbing

Part of the Zen LM family - powering broadcast-grade AI dubbing

Powered by Zen Omni's Native End-to-End Architecture

Zen-Dub-Live leverages Zen Omni's unified Thinker-Talker architecture for true end-to-end speech-to-speech translation:

┌─────────────────────────────────────────────────────────────────┐
│                         ZEN OMNI                                │
├─────────────────────────────────────────────────────────────────┤
│  THINKER (Understanding)                                        │
│  ├── AuT Audio Encoder (650M) → 12.5Hz token rate              │
│  ├── SigLIP2 Vision Encoder (540M) → lip reading, video        │
│  └── MoE LLM (48L, 128 experts) → multimodal reasoning         │
│                         ↓                                       │
│  TALKER (Speech Generation)                                     │
│  ├── MoE Transformer (20L, 128 experts)                        │
│  ├── MTP Module → 16-codebook prediction per frame             │
│  └── Code2Wav ConvNet → streaming 24kHz waveform               │
└─────────────────────────────────────────────────────────────────┘

Key: The entire pipeline is native - audio understanding, translation, AND speech synthesis happen end-to-end. No separate ASR or TTS models needed.

  • First-packet latency: 234ms (audio) / 547ms (video)
  • Built-in voices: cherry (female), noah (male)
  • Languages: 119 text, 19 speech input, 2 speech output voices

See: Zen Omni Technical Report

Adding Custom Voices

Zen-Dub-Live supports voice cloning for anchor-specific voices:

from zen_dub_live import AnchorVoice

# Clone a voice from reference audio (10-30 seconds recommended)
custom_voice = AnchorVoice.from_audio(
    "anchor_audio_sample.wav",
    name="anchor_01"
)

# Register for use in pipeline
pipeline.register_voice(custom_voice)

# Use in session
session = await pipeline.create_session(
    anchor_voice="anchor_01",
    ...
)

Voice profiles are stored as embeddings and can be saved/loaded:

# Save voice profile
custom_voice.save("voices/anchor_01.pt")

# Load voice profile
anchor_voice = AnchorVoice.load("voices/anchor_01.pt")

Overview

Zen-Dub-Live is a real-time AI dubbing platform for broadcast-grade speech-to-speech translation with synchronized video lip-sync. The system ingests live video and audio, translates speech, synthesizes anchor-specific voices, and re-renders mouth regions so that lip movements match the translated speech—all under live broadcast latency constraints.

Key Specifications

Attribute Target
Latency 2.5–3.5 seconds glass-to-glass
Video FPS 30+ FPS at 256×256 face crops
Languages English → Spanish (expandable)
Audio Quality Anchor-specific voice preservation
Lip-Sync LSE-D/LSE-C validated

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         ZEN-DUB-LIVE PIPELINE                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                      ZEN-LIVE                                     │   │
│  │  • WebRTC/WHIP/WHEP streaming (github.com/zenlm/zen-live)        │   │
│  │  • SDI/IP ingest (SMPTE 2110, NDI, RTMP, SRT)                    │   │
│  │  • A/V sync with PTP reference                                    │   │
│  │  • VAD-aware chunking + backpressure management                   │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                              ↓                                           │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                      ZEN OMNI                                     │   │
│  │  • Multimodal ASR (audio + lip reading)                          │   │
│  │  • English → Spanish translation                                  │   │
│  │  • Anchor-specific TTS                                            │   │
│  │  • Viseme/prosody generation                                      │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                              ↓                                           │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                       ZEN DUB                                     │   │
│  │  • VAE latent-space face encoding                                │   │
│  │  • One-step U-Net lip inpainting                                 │   │
│  │  • Identity-preserving composition                                │   │
│  │  • 30+ FPS real-time generation                                  │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                              ↓                                           │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                    OUTPUT MULTIPLEXING                            │   │
│  │  • Dubbed video + audio composite                                │   │
│  │  • Fallback: audio-only dubbing                                  │   │
│  │  • Distribution to downstream systems                             │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Components

1. Zen Omni - Hypermodal Language Model

  • Multimodal ASR with lip-reading enhancement
  • Domain-tuned MT for news/broadcast content
  • Anchor-specific Spanish TTS
  • Viseme/prosody generation for lip-sync control

2. Zen Dub - Neural Lip-Sync

  • VAE latent-space face encoding
  • One-step U-Net inpainting (no diffusion steps)
  • Identity-preserving mouth region modification
  • Real-time composite generation

3. Hanzo Orchestration Layer

  • Live SDI/IP feed ingest
  • A/V synchronization with PTP
  • VAD-aware semantic chunking
  • Health monitoring and fallbacks

Quick Start

Installation

pip install zen-dub-live

Basic Usage

from zen_dub_live import ZenDubLive

# Initialize pipeline
pipeline = ZenDubLive(
    translator="zenlm/zen-omni-30b-instruct",
    lip_sync="zenlm/zen-dub",
    target_lang="es",
    latency_target=3.0,
)

# Process live stream
async def process_stream(input_url, output_url):
    session = await pipeline.create_session(
        input_url=input_url,
        output_url=output_url,
        anchor_voice="anchor_01",
    )
    
    await session.start()
    # Pipeline runs until stopped
    await session.wait_for_completion()

CLI Usage

# Start live dubbing session
zen-dub-live start \
    --input rtmp://source.example.com/live \
    --output rtmp://output.example.com/spanish \
    --lang es \
    --anchor-voice anchor_01

# Monitor session
zen-dub-live status --session-id abc123

# Stop session
zen-dub-live stop --session-id abc123

API Reference

Session Lifecycle

CreateSession

session = await pipeline.create_session(
    input_url="rtmp://...",
    output_url="rtmp://...",
    target_lang="es",
    anchor_voice="anchor_01",
    latency_target=3.0,
)

StreamIngest (WebSocket/gRPC)

async for chunk in session.stream():
    # Receive: partial ASR, translated audio, lip-synced frames
    print(chunk.translation_text)
    yield chunk.dubbed_audio, chunk.lip_synced_frame

CommitOutput

await session.commit(segment_id)  # Mark segment as stable for playout

Configuration

# config.yaml
pipeline:
  latency_target: 3.0
  chunk_duration: 2.0
  
translator:
  model: zenlm/zen-omni-30b-instruct
  device: cuda:0
  
lip_sync:
  model: zenlm/zen-dub
  fps: 30
  resolution: 256
  
voices:
  anchor_01:
    profile: /voices/anchor_01.pt
    style: news_neutral
  anchor_02:
    profile: /voices/anchor_02.pt
    style: breaking_news

Performance

Latency Breakdown

Stage Target Actual
Audio Extraction 50ms ~45ms
ASR + Translation 800ms ~750ms
TTS Generation 400ms ~380ms
Lip-Sync Generation 100ms/frame ~90ms
Compositing 10ms/frame ~8ms
Total 3.0s ~2.8s

Quality Metrics

Metric Target Achieved
ASR WER <10% 7.2%
MT BLEU >40 42.3
TTS MOS >4.0 4.2
LSE-D (sync) <8.0 7.8
LSE-C (confidence) >3.0 3.2

Deployment

On-Premises

# docker-compose.yml
services:
  zen-dub-live:
    image: zenlm/zen-dub-live:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    environment:
      - TRANSLATOR_MODEL=zenlm/zen-omni-30b-instruct
      - LIP_SYNC_MODEL=zenlm/zen-dub
    ports:
      - "8765:8765"  # WebSocket API
      - "50051:50051"  # gRPC API

Hosted (Hanzo Cloud)

# Deploy to Hanzo Cloud
zen-dub-live deploy --region us-west \
    --input-url rtmp://source/live \
    --output-url rtmp://output/spanish

Documentation

Resources

Related Projects

Citation

@misc{zen-dub-live-2024,
  title={Zen-Dub-Live: Real-Time Speech-to-Speech Translation and Lip-Synchronized Video Dubbing},
  author={Zen LM Team and Hanzo AI},
  year={2024},
  url={https://github.com/zenlm/zen-dub-live}
}

Organizations

License

Apache 2.0 • No data collection • Privacy-first

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for zenlm/zen-dub-live