neilienz commited on
Commit
b149bb2
·
verified ·
1 Parent(s): db854df

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - fr
5
+ - en
6
+ library_name: hibiki
7
+ tags:
8
+ - speech
9
+ - translation
10
+ - streaming
11
+ metrics:
12
+ - bleu
13
+ ---
14
+
15
+ # Model Card for Hibiki
16
+
17
+ [Hibiki](https://github.com/kyutai-labs/hibiki) is a model for streaming speech translation (also known as *simultaneous* translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- Hibiki adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. As the user speaks, Hibiki generates natural speech in the target language, optionally with voice transfer, along with a text translation.
18
+ Hibiki currently only supports French-to-English translation.
19
+
20
+ ## Model Details
21
+
22
+ This is the model simply referred to as *Hibiki* in our [paper](https://arxiv.org/abs/2502.03382), a 2.7B parameter
23
+ hierarchical Transformer producing speech and text tokens at a framerate of 12.5Hz, with audio being generated at a
24
+ 2.2kbps bitrate.
25
+
26
+ ### Model Description
27
+
28
+ Hibiki is a decoder-only model for simultaneous speech translation. Hibiki leverages the multistream architecture of [Moshi](https://arxiv.org/abs/2410.00037)
29
+ to model source and target speech jointly. This allows Hibiki to continuously process the input stream while generating
30
+ the target speech. Hibiki produces text and audio tokens at a constant framerate of 12.5Hz. This allows for a continuous
31
+ output audio stream, along with timestamped text tranlsation. Since Hibiki relies on simple temperature sampling,
32
+ it is compatible with batching unlike models that rely on complex inference policies. Moreover, the fidelity of Hibiki's
33
+ voice transfer can be controlled by changing the coefficient of the Classifier-Free Guidance: a larger coefficient will
34
+ increase voice similarity, but excessive coefficients can lead to worse translations.
35
+
36
+
37
+ - **Developed by:** Kyutai
38
+ - **Model type:** Simultaneous speech-to-speech and speech-to-text translation.
39
+ - **Language(s) (NLP):** French-to-English
40
+ - **License:** CC-BY
41
+
42
+ ### Model Sources
43
+
44
+
45
+ - **Repository:** [repo](https://github.com/kyutai-labs/hibiki)
46
+ - **Paper:** [paper](https://arxiv.org/abs/2502.03382)
47
+ - **Examples:** [demo](https://hf.co/spaces/kyutai/hibiki-samples)
48
+
49
+ ## Uses
50
+
51
+ ### Direct Use
52
+
53
+ The model can be used for streaming translation from French to English in real-time settings, or for batched
54
+ simultaneous translation of many input sequences. It is robust to noisy conditions and is trained on sequences up
55
+ to 120 seconds.
56
+
57
+
58
+ ### Downstream Use
59
+
60
+ Some components of the model can be used independently or repurposed relatively easily.
61
+ For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems. Regarding the main Hibiki architecture,
62
+ supporting other pairs of languages would require finetuning.
63
+
64
+
65
+ ### Out-of-Scope Use
66
+
67
+ The model is not intended to be used to impersonate other people or any malicious use of any kind. .
68
+
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ See the main [README](https://github.com/kyutai-labs/hibiki) file.
73
+
74
+ ## Training Details
75
+
76
+ ### Training Data
77
+
78
+ - Textual data: The underlying [Helium](https://huggingface.co/kyutai/helium-1-preview-2b) model is trained on a mix of
79
+ data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.
80
+
81
+ - Audio data
82
+
83
+ - **Unsupervised audio dataset:** used for pre-training, this is a collection of 7M hours of readily available audio content in English and 450k hours in French, following the preprocessing and recipe of [Moshi](https://arxiv.org/abs/2410.00037).
84
+ - **Synthetic translation dataset**: Around 40k hours of parallel French-English data synthesized with *contextual alignment* (see [Section 3.2](https://arxiv.org/pdf/2502.03382)) with various levels of speaker similarity.
85
+ - **Translation finetuning:** A 900 hours mixture of a resynthesized version of [CVSS-T](https://github.com/google-research-datasets/cvss) and synthetic long-form utterances.
86
+
87
+ ### Training procedure and hyper-parameters
88
+
89
+ The different stages of the training procedure are detailled in the paper along with the hyper-parameters.
90
+
91
+ ### Compute Infrastructure
92
+
93
+ The final model was trained on 48 H100 Nvidia GPUs.
94
+
95
+ ## Citation
96
+
97
+ ```
98
+ @misc{labiausse2025hibiki,
99
+ title={High-Fidelity Simultaneous Speech-To-Speech Translation},
100
+ author={Tom Labiausse and Laurent Mazaré and Edouard Grave and Patrick Pérez and Alexandre Défossez and Neil Zeghidour},
101
+ year={2025},
102
+ eprint={2502.03382},
103
+ archivePrefix={arXiv},
104
+ primaryClass={cs.CL},
105
+ url={https://arxiv.org/abs/2502.03382},
106
+ }
107
+ ```
108
+
109
+
110
+ ## Model Card Authors
111
+
112
+ Tom Labiausse, Laurent Mazaré, Edouard Grave, Patrick Pérez, Alexandre Défossez, Neil Zeghidour