|
--- |
|
library_name: mlx |
|
pipeline_tag: text-generation |
|
inference: false |
|
license: apache-2.0 |
|
base_model: openai/gpt-oss-20b |
|
base_model_relation: quantized |
|
language: |
|
- en |
|
- ro |
|
tags: |
|
- apple-silicon |
|
- metal |
|
- arm64 |
|
- 5-bit |
|
- group-size-32 |
|
- moe |
|
- mpx4 |
|
- openai |
|
- halley-ai |
|
--- |
|
# gpt-oss-20b — MLX 5-bit (group size 32) |
|
|
|
**Summary.** This is a 5-bit (**Q5**) **MLX** quantization of **gpt-oss-20B** (sparse Mixture-of-Experts, MPx4). Group size is **32**. |
|
Built for **Apple Silicon** with Metal acceleration. |
|
|
|
- **Base model:** `openai/gpt-oss-20b` (Apache-2.0) |
|
- **Quantization:** MLX Q5, `q_group_size=32` (some tensors remain FP16 for stability) |
|
- **Files:** MLX weight shards + `config.json`; tokenizer files included for drop-in use |
|
- **Footprint:** ~**15.76 GB** on disk |
|
- **Intended use:** local inference / research on M-series Macs |
|
- **Not intended for:** safety-critical decisions; outputs may be inaccurate or biased |
|
|
|
## Requirements |
|
**Runs on:** Apple Silicon (M1 or newer) with **macOS ≥ 13.5** via **MLX (Metal)**. |
|
**Not supported:** Intel macOS / Linux / Windows (use a GGUF build + llama.cpp instead). |
|
**RAM guidance:** **24 GB minimum** for Q5 gs=32. Extra RAM improves headroom. |
|
|
|
## How to use (MLX) |
|
|
|
```bash |
|
pip install mlx-lm transformers |
|
``` |
|
|
|
```python |
|
# Python API (uses tokenizer bundled with this repo) |
|
from mlx_lm import load, generate |
|
|
|
model, tokenizer = load("halley-ai/gpt-oss-20b-MLX-5bit-gs32") |
|
print(generate( |
|
model, tokenizer, |
|
prompt="Explain the Chudnovsky algorithm to compute π.", |
|
max_tokens=256, max_kv_size=512 |
|
)) |
|
``` |
|
|
|
```bash |
|
# CLI |
|
python -m mlx_lm generate --model halley-ai/gpt-oss-20b-MLX-5bit-gs32 \ |
|
--prompt "Explain the Chudnovsky algorithm to compute pi." \ |
|
--max-kv-size 512 --max-tokens 256 |
|
``` |
|
|
|
## Performance (Apple Silicon, real-world) |
|
|
|
LM Studio / CLI (MLX, Q5 gs=32) ≈2k-token responses: |
|
- M1 Max (32 GB): ~45–50 tok/s, 0.40–0.60 s TTFB |
|
- M4 Pro (24 GB): ~65–70 tok/s, 0.25–0.45 s TTFB |
|
- M3 Ultra (256 GB): pending |
|
|
|
Throughput varies with Mac model, context, and sampler settings. |
|
|
|
## Evaluation |
|
|
|
Perplexity (PPL) streaming evaluation on WikiText-2; window=stride=4096, ~100k tokens, EOS inserted between docs. |
|
<table> |
|
<thead> |
|
<tr><th>Variant</th><th>PPL (ctx=4096)</th></tr> |
|
</thead> |
|
<tbody> |
|
<tr><td>MLX 8-bit (reference)</td><td>10.75</td></tr> |
|
<tr><td>MLX 6-bit (gs=32)</td><td>10.46 (−2.7% vs 8-bit/gs64)</td></tr> |
|
<tr><td><strong>MLX 5-bit (gs=32)</strong></td><td><strong>11.11 (+3.3% vs 8-bit/gs64, +6.2% vs 6-bit/gs32)</strong></td></tr> |
|
<tr><td>MLX 4-bit (gs=32)</td><td>13.70 (+27.4% vs 8-bit/gs64, +31.0% vs 6-bit/gs32)</td></tr> |
|
</tbody> |
|
</table> |
|
|
|
**Interpretation** |
|
- MLX 6-bit/gs32: Best of the group; edges out 8-bit/gs64 slightly at a smaller |
|
footprint. |
|
- MLX 5-bit/gs32: Small, consistent drop vs 6-bit/gs32 and 8-bit/gs64 (~3–6% PPL); strong “fits-16GB” option when GPU buffer limits matter. |
|
- MLX 8-bit/gs64: Solid reference; near‑FP16 quality at a larger footprint. |
|
- MLX 4-bit/gs32: Trades accuracy for footprint; use when RAM is constrained or throughput is the priority. |
|
|
|
## Conversion details (provenance) |
|
|
|
```bash |
|
python -m mlx_lm convert \ |
|
--hf-path openai/gpt-oss-20b \ |
|
--mlx-path gpt-oss-20b-mlx-q5-gs32 \ |
|
--q-bits 5 --q-group-size 32 -q |
|
``` |
|
|
|
- Some non-expert tensors (embeddings, norms, router) remain FP16. |
|
|
|
## Sibling & reference models |
|
- halley-ai/gpt-oss-20b-MLX-6bit-gs32 |
|
- halley-ai/gpt-oss-20b-MLX-4bit-gs32 |
|
- Reference (8-bit, upstream): lmstudio-community/gpt-oss-20b-MLX-8bit |
|
|
|
## Limitations & biases |
|
|
|
Outputs may be factually wrong or unsafe. Don’t use for medical, legal, or financial decisions without human review. |
|
MoE models can be sensitive to prompt wording; prefer explicit instructions and structure. |
|
|
|
## License & credits |
|
- License: Apache-2.0 (inherits from base model) |
|
- Base model: OpenAI gpt-oss-20B |
|
- Quantization: Halley AI Lab (MLX Q5, gs=32) |
|
- Please cite both the base model and this repository when you use the weights. |
|
|