---
base_model:
- mistralai/Voxtral-Small-24B-2507
- mistralai/Mistral-Small-3.2-24B-Instruct-2506
license: apache-2.0
pipeline_tag: any-to-any
---

# Home-cooked Mistral Small Omni

This is a multimodal model created by merging Mistral Small 2506 (with vision capabilities) and Voxtral 2507 (with audio capabilities) using a modified version of the `mergekit` tool.

For detailed merging instructions, refer to the sections below.

<img width=300 src="https://cdn-uploads.huggingface.co/production/uploads/63ca214abedad7e2bf1d1517/-0Su33gQArUjp5gSco8pD.png" />

## License and Attribution

This model is a merged derivative work combining Mistral Small 2506 and Voxtral 2507, both originally released by Mistral AI under the Apache 2.0 license. The merged model is also distributed under the Apache 2.0 license, and the full license text, along with original copyright notices, is included in this repository. I have no affiliation, sponsorship, or formal relationship with Mistral AI. This project is an independent effort to combine the vision and audio capabilities of the two models.

## Steps to reproduce

### Merge text model

Install `mergekit` from this version: https://github.com/arcee-ai/mergekit/tree/0027c5c51471fa891d438eccda5455ebe55b536e

Modify the `mergekit` source code, open file `mergekit/merge_methods/generalized_task_arithmetic.py`

```py
    # Normalize the vectors to get the directions and angles
    v0 = normalize(v0, eps)
    v1 = normalize(v1, eps)

    if v0.shape != v1.shape:                # ADD THIS
        res = np.array([0.0])               # ADD THIS
        return maybe_torch(res, is_torch)   # ADD THIS

    # Dot product with the normalized vectors (can't use np.dot in W)
    dot = np.sum(v0 * v1)

    # If absolute value of dot product is almost 1, vectors are ~colinear, so use lerp
    if np.abs(dot) > DOT_THRESHOLD:
        res = lerp(t, v0_copy, v1_copy)
        return maybe_torch(res, is_torch)
```

Prepare YAML file for merging config:

```yaml
name: mistral-omni
merge_method: slerp
models:
  - model: ../models/Voxtral-Small-24B-2507
  - model: ../models/Mistral-Small-3.2-24B-Instruct-2506
base_model: ../models/Mistral-Small-3.2-24B-Instruct-2506
parameters:
  t:
    - filter: self_attn
      value: [0.1, 0.3, 0.5, 0.3, 0.1, 0]
    - filter: mlp
      value: [0.1, 0.3, 0.5, 0.3, 0.1, 0]
    - value: 0.5 # fallback for rest of tensors
dtype: bfloat16
```

Merge it:

```sh
mergekit-yaml mistral_o.yaml ../models/mistral_o
```

Go to the `mistral_o` output directory, then download `tekken.json` from Voxtral and place it there: https://huggingface.co/mistralai/Voxtral-Small-24B-2507/blob/main/tekken.json

Finally, use `convert_hf_to_gguf.py` to convert it back to GGUF as usual

### Merge mmproj models

Download these mmproj files:
- Audio: https://huggingface.co/ggml-org/Voxtral-Mini-3B-2507-GGUF/blob/main/mmproj-Voxtral-Mini-3B-2507-Q8_0.gguf
- Vision: https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/blob/main/mmproj-F16.gguf

Rename them to `audio.gguf`and `vision.gguf` respectively

Then run [merge_mmproj_models.py](https://huggingface.co/ngxson/Home-Cook-Mistral-Small-Omni-24B-2507-GGUF/blob/main/merge_mmproj_models.py) from this repo. The output file will be `mmproj-model.gguf`