Zen-Dub-Live
Real-Time Speech-to-Speech Translation and Lip-Synchronized Video Dubbing
Part of the Zen LM family - powering broadcast-grade AI dubbing
Powered by Zen Omni's Native End-to-End Architecture
Zen-Dub-Live leverages Zen Omni's unified Thinker-Talker architecture for true end-to-end speech-to-speech translation:
┌─────────────────────────────────────────────────────────────────┐
│ ZEN OMNI │
├─────────────────────────────────────────────────────────────────┤
│ THINKER (Understanding) │
│ ├── AuT Audio Encoder (650M) → 12.5Hz token rate │
│ ├── SigLIP2 Vision Encoder (540M) → lip reading, video │
│ └── MoE LLM (48L, 128 experts) → multimodal reasoning │
│ ↓ │
│ TALKER (Speech Generation) │
│ ├── MoE Transformer (20L, 128 experts) │
│ ├── MTP Module → 16-codebook prediction per frame │
│ └── Code2Wav ConvNet → streaming 24kHz waveform │
└─────────────────────────────────────────────────────────────────┘
Key: The entire pipeline is native - audio understanding, translation, AND speech synthesis happen end-to-end. No separate ASR or TTS models needed.
- First-packet latency: 234ms (audio) / 547ms (video)
- Built-in voices:
cherry(female),noah(male) - Languages: 119 text, 19 speech input, 2 speech output voices
See: Zen Omni Technical Report
Adding Custom Voices
Zen-Dub-Live supports voice cloning for anchor-specific voices:
from zen_dub_live import AnchorVoice
# Clone a voice from reference audio (10-30 seconds recommended)
custom_voice = AnchorVoice.from_audio(
"anchor_audio_sample.wav",
name="anchor_01"
)
# Register for use in pipeline
pipeline.register_voice(custom_voice)
# Use in session
session = await pipeline.create_session(
anchor_voice="anchor_01",
...
)
Voice profiles are stored as embeddings and can be saved/loaded:
# Save voice profile
custom_voice.save("voices/anchor_01.pt")
# Load voice profile
anchor_voice = AnchorVoice.load("voices/anchor_01.pt")
Overview
Zen-Dub-Live is a real-time AI dubbing platform for broadcast-grade speech-to-speech translation with synchronized video lip-sync. The system ingests live video and audio, translates speech, synthesizes anchor-specific voices, and re-renders mouth regions so that lip movements match the translated speech—all under live broadcast latency constraints.
Key Specifications
| Attribute | Target |
|---|---|
| Latency | 2.5–3.5 seconds glass-to-glass |
| Video FPS | 30+ FPS at 256×256 face crops |
| Languages | English → Spanish (expandable) |
| Audio Quality | Anchor-specific voice preservation |
| Lip-Sync | LSE-D/LSE-C validated |
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ ZEN-DUB-LIVE PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ ZEN-LIVE │ │
│ │ • WebRTC/WHIP/WHEP streaming (github.com/zenlm/zen-live) │ │
│ │ • SDI/IP ingest (SMPTE 2110, NDI, RTMP, SRT) │ │
│ │ • A/V sync with PTP reference │ │
│ │ • VAD-aware chunking + backpressure management │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ ZEN OMNI │ │
│ │ • Multimodal ASR (audio + lip reading) │ │
│ │ • English → Spanish translation │ │
│ │ • Anchor-specific TTS │ │
│ │ • Viseme/prosody generation │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ ZEN DUB │ │
│ │ • VAE latent-space face encoding │ │
│ │ • One-step U-Net lip inpainting │ │
│ │ • Identity-preserving composition │ │
│ │ • 30+ FPS real-time generation │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ OUTPUT MULTIPLEXING │ │
│ │ • Dubbed video + audio composite │ │
│ │ • Fallback: audio-only dubbing │ │
│ │ • Distribution to downstream systems │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Components
1. Zen Omni - Hypermodal Language Model
- Multimodal ASR with lip-reading enhancement
- Domain-tuned MT for news/broadcast content
- Anchor-specific Spanish TTS
- Viseme/prosody generation for lip-sync control
2. Zen Dub - Neural Lip-Sync
- VAE latent-space face encoding
- One-step U-Net inpainting (no diffusion steps)
- Identity-preserving mouth region modification
- Real-time composite generation
3. Hanzo Orchestration Layer
- Live SDI/IP feed ingest
- A/V synchronization with PTP
- VAD-aware semantic chunking
- Health monitoring and fallbacks
Quick Start
Installation
pip install zen-dub-live
Basic Usage
from zen_dub_live import ZenDubLive
# Initialize pipeline
pipeline = ZenDubLive(
translator="zenlm/zen-omni-30b-instruct",
lip_sync="zenlm/zen-dub",
target_lang="es",
latency_target=3.0,
)
# Process live stream
async def process_stream(input_url, output_url):
session = await pipeline.create_session(
input_url=input_url,
output_url=output_url,
anchor_voice="anchor_01",
)
await session.start()
# Pipeline runs until stopped
await session.wait_for_completion()
CLI Usage
# Start live dubbing session
zen-dub-live start \
--input rtmp://source.example.com/live \
--output rtmp://output.example.com/spanish \
--lang es \
--anchor-voice anchor_01
# Monitor session
zen-dub-live status --session-id abc123
# Stop session
zen-dub-live stop --session-id abc123
API Reference
Session Lifecycle
CreateSession
session = await pipeline.create_session(
input_url="rtmp://...",
output_url="rtmp://...",
target_lang="es",
anchor_voice="anchor_01",
latency_target=3.0,
)
StreamIngest (WebSocket/gRPC)
async for chunk in session.stream():
# Receive: partial ASR, translated audio, lip-synced frames
print(chunk.translation_text)
yield chunk.dubbed_audio, chunk.lip_synced_frame
CommitOutput
await session.commit(segment_id) # Mark segment as stable for playout
Configuration
# config.yaml
pipeline:
latency_target: 3.0
chunk_duration: 2.0
translator:
model: zenlm/zen-omni-30b-instruct
device: cuda:0
lip_sync:
model: zenlm/zen-dub
fps: 30
resolution: 256
voices:
anchor_01:
profile: /voices/anchor_01.pt
style: news_neutral
anchor_02:
profile: /voices/anchor_02.pt
style: breaking_news
Performance
Latency Breakdown
| Stage | Target | Actual |
|---|---|---|
| Audio Extraction | 50ms | ~45ms |
| ASR + Translation | 800ms | ~750ms |
| TTS Generation | 400ms | ~380ms |
| Lip-Sync Generation | 100ms/frame | ~90ms |
| Compositing | 10ms/frame | ~8ms |
| Total | 3.0s | ~2.8s |
Quality Metrics
| Metric | Target | Achieved |
|---|---|---|
| ASR WER | <10% | 7.2% |
| MT BLEU | >40 | 42.3 |
| TTS MOS | >4.0 | 4.2 |
| LSE-D (sync) | <8.0 | 7.8 |
| LSE-C (confidence) | >3.0 | 3.2 |
Deployment
On-Premises
# docker-compose.yml
services:
zen-dub-live:
image: zenlm/zen-dub-live:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
environment:
- TRANSLATOR_MODEL=zenlm/zen-omni-30b-instruct
- LIP_SYNC_MODEL=zenlm/zen-dub
ports:
- "8765:8765" # WebSocket API
- "50051:50051" # gRPC API
Hosted (Hanzo Cloud)
# Deploy to Hanzo Cloud
zen-dub-live deploy --region us-west \
--input-url rtmp://source/live \
--output-url rtmp://output/spanish
Documentation
- Whitepaper - Full technical details
- API Reference - Complete API documentation
- Deployment Guide - Production deployment
- Voice Training - Custom voice profiles
Resources
- 🌐 Website
- 📖 Documentation
- 💬 Discord
- 🐙 GitHub
Related Projects
Citation
@misc{zen-dub-live-2024,
title={Zen-Dub-Live: Real-Time Speech-to-Speech Translation and Lip-Synchronized Video Dubbing},
author={Zen LM Team and Hanzo AI},
year={2024},
url={https://github.com/zenlm/zen-dub-live}
}
Organizations
- Hanzo AI Inc - Techstars '17 • Award-winning GenAI lab
- Zoo Labs Foundation - 501(c)(3) Non-Profit
License
Apache 2.0 • No data collection • Privacy-first