---
library_name: transformers
license: mit
datasets:
- openslr/librispeech_asr
- slprl/SpokenSwag
- slprl/sTinyStories
base_model:
- Qwen/Qwen2.5-0.5B
pipeline_tag: audio-to-audio
---

# Model Card for Model ID
This is a Speech Lanaguage Model trained for generating speech contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).


## Model Details

### Model Description
This is a Speech Lanaguage Model, introduced in "_Slamming_: Training a Speech Language Model on One GPU in a Day", focusing on efficient training. 
It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from 
the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).

The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data 
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was then trained with DPO over 
[SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).

- **Developed by:** [SLP-RL](https://huggingface.co/slprl)
- **Model type:** SpeechLM
- **License:** MIT
- **Finetuned from model:** [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)

### Model Sources

- **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
- **Paper:** [Soon!]
- **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)

## Uses
This is a base SpeechLM and as such can be used to generate contiuations for speech segments, or as base for further tuning. See the _SlamKit_
[codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples

### Out-of-Scope Use
This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.


## How to Get Started with the Model
We refer users to the official repository for full usage explainations - [github](https://github.com/slp-rl/slamkit).


## Training Details
We highly encourage users to read the full [paper](), for full training details, a brief overview is provided below.


### Training Data
This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train, 
[Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).

### Training Procedure
This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
Please refer to the [paper]() or [code](https://github.com/slp-rl/slamkit) for the full training recipes.

#### Preprocessing
Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the 
official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated.
We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit).


## Evaluation
The paper provides full results, we do give here some results and also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) to listen to some samples.

| Model                                     | GPUs    | Params | Num Tokens    | sBLIMP ↑  | sStoryCloze ↑ | tStoryCloze ↑ | GenPPL ↓ | Auto-BLEU ↓ |
|-------------------------------------------|---------|--------|---------------|-----------|---------------|---------------|----------|-------------|
| **Speech only pre-training**              |         |        |               |           |               |               |          |             |
| GSLM                                      | 8×V100  | 100M   | 1B            | 54.2      | 53.3          | 66.6          | —        | —           |
| SyllableLM                                | 4×A40   | 300M   | 16B           | 63.7      | —             | 75.4          | —        | —           |
| TWIST-350M                                | 8×V100  | 305M   | 10.8B         | 56.2      | —             | —             | 137.3    | 3.46        |
| TWIST-1.3B                                | 32×V100 | 1B     | 10.8B         | 57.0      | 52.4          | 70.6          | 131.8    | 3.20        |
| TWIST-7B                                  | 32×V100 | 7B     | 36B           | 59.0      | 55.3          | 74.1          | 93.74    | 3.06        |
| TWIST-13B                                 | 32×V100 | 13B    | 36B           | 59.2      | 55.4          | 76.4          | —        | —           |
| Scaled Optimal                            | —       | 823M   | 82B           | **61.3**  | 56.7          | 78.0          | —        | —           |
| Moshi                                     | ?×H100  | 7B     | ?             | 58.9      | **58.7**      | **81.8**      | —        | —           |
| SpiritLM                                  | 64×A100 | 7B     | 100B          | 58.0      | 54.8          | 72.9          | —        | —           |
| **With text / preference optimization**   |         |        |               |           |               |               |          |             |
| Scaling Interleaving                      | —       | 9B     | ~1T           | —         | **62.4**      | 82.9          | —        | —           |
| Moshi                                     | ?×H100  | 7B     | ~720B         | 58.8      | 60.8          | 83.0          | —        | —           |
| SpiritLM                                  | 64×A100 | 7B     | 100B          | 58.3      | 61.0          | 82.9          | —        | —           |
| AlignSLM-1.3B                             | 64×A100 | 1B     | 10.8B + ~158B | 59.8      | 55.0          | 80.0          | —        | —           |
| AlignSLM-7B                               | 64×A100 | 7B     | 36B + ~158B   | **62.3**  | 61.1          | **86.8**      | —        | —           |
| **Ours (_Slam_)**                         |         |        |               |           |               |               |          |             |
| _Slam_ (-DPO)                             | 2×A100  | 358M   | 16.7B         | 58.53     | 58.15         | 80.71         | 67.3     | 3.25        |
| _Slam_                                    | 1×A5000 | 358M   | 1.4B + 5M     | 58.86     | 58.04         | 82.04         | 62.8     | 3.88        |
| _Slam_ (scaled)                           | 2×A100  | 358M   | 16.7B + 9M    | **61.11** | **61.30**     | **84.18**     | **46.6** | 3.75        |


### Compute Infrastructure
This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training.

#### Hardware
This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.

#### Software
The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
easy and efficent training of Speech Language Models.

## Citation

**BibTeX:**
Soon!