Papers
arxiv:2512.07168

JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

Published on Dec 8
· Submitted by
Aman Chadha
on Dec 9
Authors:
,
,
,
,
,

Abstract

A two-stage self-supervised framework combines JEPA with DAAM to learn robust speech representations, using masked prediction, FSQ, mixed-radix packing, and HiFi-GAN for efficient tokenization and waveform reconstruction.

AI-generated summary

We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.

Community

Paper author Paper submitter

Screenshot 2025-12-09 at 3.54.32 PM

This paper introduces JEPA+DAAM, a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Gaussian mixture–based Density Adaptive Attention Mechanism (DAAM) to learn semantically rich and highly compressible speech representations, achieving reversible neural tokenization at 47.5 tokens/sec with strong reconstruction quality.

➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐉𝐄𝐏𝐀+𝐃𝐀𝐀𝐌:

🧠 𝑱𝑬𝑷𝑨 𝑭𝒐𝒓 𝑺𝒆𝒍𝒇-𝑺𝒖𝒑𝒆𝒓𝒗𝒊𝒔𝒆𝒅 𝑺𝒑𝒆𝒆𝒄𝒉 𝑬𝒏𝒄𝒐𝒅𝒊𝒏𝒈:
Decouples representation learning from waveform reconstruction using a masked-prediction JEPA objective in latent space. The encoder learns semantically meaningful embeddings at 2.5 Hz without requiring low-level waveform loss, enabling robust self-supervised pretraining and cross-task adaptability (ASR, TTS, voice conversion).

🎯 𝑫𝒆𝒏𝒔𝒊𝒕𝒚 𝑨𝒅𝒂𝒑𝒕𝒊𝒗𝒆 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 (𝑫𝑨𝑨𝑴):
Introduces a Gaussian mixture–based gating attention mechanism that modulates temporal features based on local statistical salience rather than pairwise dot-products. This allows adaptive feature selection and hierarchical speech-structure discovery with linear complexity, improving JEPA convergence (loss 0.09 vs 0.17 without DAAM).

🧩 𝑭𝑺𝑸 + 𝑴𝒊𝒙𝒆𝒅-𝑹𝒂𝒅𝒊𝒙 𝑻𝒐𝒌𝒆𝒏𝒊𝒛𝒂𝒕𝒊𝒐𝒏:
Implements Finite Scalar Quantization (FSQ) with mixed-radix integer packing for reversible, codebook-free tokenization (47.5 tokens/sec, 16 384-way vocabulary). Combined with a HiFi-GAN decoder, it achieves high-fidelity waveform reconstruction, outperforming or matching neural audio codecs like SoundStream and EnCodec at dramatically lower frame rates.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.07168 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.07168 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.07168 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.