--- library_name: transformers license: mit datasets: - openslr/librispeech_asr - slprl/SpokenSwag - slprl/sTinyStories base_model: - Qwen/Qwen2.5-0.5B pipeline_tag: audio-to-audio --- # Model Card for Model ID This is a Speech Lanaguage Model trained for generating speech contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz). ## Model Details ### Model Description This is a Speech Lanaguage Model, introduced in "_Slamming_: Training a Speech Language Model on One GPU in a Day", focusing on efficient training. It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz). The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag). - **Developed by:** [SLP-RL](https://huggingface.co/slprl) - **Model type:** SpeechLM - **License:** MIT - **Finetuned from model:** [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) ### Model Sources - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit) - **Paper:** [Soon!] - **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) ## Uses This is a base SpeechLM and as such can be used to generate contiuations for speech segments, or as base for further tuning. See the _SlamKit_ [codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples ### Out-of-Scope Use This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way. ## How to Get Started with the Model We refer users to the official repository for full usage explainations - [github](https://github.com/slp-rl/slamkit). ## Training Details We highly encourage users to read the full [paper](), for full training details, a brief overview is provided below. ### Training Data This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train, [Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag). ### Training Procedure This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag). Please refer to the [paper]() or [code](https://github.com/slp-rl/slamkit) for the full training recipes. #### Preprocessing Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated. We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit). ## Evaluation The paper provides full results, we do give here some results and also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) to listen to some samples. | Model | GPUs | Params | Num Tokens | sBLIMP ↑ | sStoryCloze ↑ | tStoryCloze ↑ | GenPPL ↓ | Auto-BLEU ↓ | |-------------------------------------------|---------|--------|---------------|-----------|---------------|---------------|----------|-------------| | **Speech only pre-training** | | | | | | | | | | GSLM | 8×V100 | 100M | 1B | 54.2 | 53.3 | 66.6 | — | — | | SyllableLM | 4×A40 | 300M | 16B | 63.7 | — | 75.4 | — | — | | TWIST-350M | 8×V100 | 305M | 10.8B | 56.2 | — | — | 137.3 | 3.46 | | TWIST-1.3B | 32×V100 | 1B | 10.8B | 57.0 | 52.4 | 70.6 | 131.8 | 3.20 | | TWIST-7B | 32×V100 | 7B | 36B | 59.0 | 55.3 | 74.1 | 93.74 | 3.06 | | TWIST-13B | 32×V100 | 13B | 36B | 59.2 | 55.4 | 76.4 | — | — | | Scaled Optimal | — | 823M | 82B | **61.3** | 56.7 | 78.0 | — | — | | Moshi | ?×H100 | 7B | ? | 58.9 | **58.7** | **81.8** | — | — | | SpiritLM | 64×A100 | 7B | 100B | 58.0 | 54.8 | 72.9 | — | — | | **With text / preference optimization** | | | | | | | | | | Scaling Interleaving | — | 9B | ~1T | — | **62.4** | 82.9 | — | — | | Moshi | ?×H100 | 7B | ~720B | 58.8 | 60.8 | 83.0 | — | — | | SpiritLM | 64×A100 | 7B | 100B | 58.3 | 61.0 | 82.9 | — | — | | AlignSLM-1.3B | 64×A100 | 1B | 10.8B + ~158B | 59.8 | 55.0 | 80.0 | — | — | | AlignSLM-7B | 64×A100 | 7B | 36B + ~158B | **62.3** | 61.1 | **86.8** | — | — | | **Ours (_Slam_)** | | | | | | | | | | _Slam_ (-DPO) | 2×A100 | 358M | 16.7B | 58.53 | 58.15 | 80.71 | 67.3 | 3.25 | | _Slam_ | 1×A5000 | 358M | 1.4B + 5M | 58.86 | 58.04 | 82.04 | 62.8 | 3.88 | | _Slam_ (scaled) | 2×A100 | 358M | 16.7B + 9M | **61.11** | **61.30** | **84.18** | **46.6** | 3.75 | ### Compute Infrastructure This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training. #### Hardware This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**. #### Software The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support easy and efficent training of Speech Language Models. ## Citation **BibTeX:** Soon!