A newer version of this model is available: onnx-community/Kokoro-82M-v1.0-ONNX

Kokoro TTS

Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out).

Table of contents

Samples

Life is like a box of chocolates. You never know what you're gonna get.

Voice Nationality Gender Sample
Default (af) American Female
Bella (af_bella) American Female
Nicole (af_nicole) American Female
Sarah (af_sarah) American Female
Sky (af_sky) American Female
Adam (am_adam) American Male
Michael (am_michael) American Male
Emma (bf_emma) British Female
Isabella (bf_isabella) British Female
George (bm_george) British Male
Lewis (bm_lewis) British Male

Usage

JavaScript

First, install the kokoro-js library from NPM using:

npm i kokoro-js

You can then generate speech as follows:

import { KokoroTTS } from "kokoro-js";

const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
  dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});

const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text, {
  // Use `tts.list_voices()` to list all available voices
  voice: "af_bella",
});
audio.save("audio.wav");

Python

import os
import numpy as np
from onnxruntime import InferenceSession

# Tokens produced by phonemize() and tokenize() in kokoro.py
tokens = [50, 157, 43, 135, 16, 53, 135, 46, 16, 43, 102, 16, 56, 156, 57, 135, 6, 16, 102, 62, 61, 16, 70, 56, 16, 138, 56, 156, 72, 56, 61, 85, 123, 83, 44, 83, 54, 16, 53, 65, 156, 86, 61, 62, 131, 83, 56, 4, 16, 54, 156, 43, 102, 53, 16, 156, 72, 61, 53, 102, 112, 16, 70, 56, 16, 138, 56, 44, 156, 76, 158, 123, 56, 16, 62, 131, 156, 43, 102, 54, 46, 16, 102, 48, 16, 81, 47, 102, 54, 16, 54, 156, 51, 158, 46, 16, 70, 16, 92, 156, 135, 46, 16, 54, 156, 43, 102, 48, 4, 16, 81, 47, 102, 16, 50, 156, 72, 64, 83, 56, 62, 16, 156, 51, 158, 64, 83, 56, 16, 44, 157, 102, 56, 16, 44, 156, 76, 158, 123, 56, 4]

# Context length is 512, but leave room for the pad token 0 at the start & end
assert len(tokens) <= 510, len(tokens)

# Style vector based on len(tokens), ref_s has shape (1, 256)
voices = np.fromfile('./voices/af.bin', dtype=np.float32).reshape(-1, 1, 256)
ref_s = voices[len(tokens)]

# Add the pad ids, and reshape tokens, should now have shape (1, <=512)
tokens = [[0, *tokens, 0]]

model_name = 'model.onnx' # Options: model.onnx, model_fp16.onnx, model_quantized.onnx, model_q8f16.onnx, model_uint8.onnx, model_uint8f16.onnx, model_q4.onnx, model_q4f16.onnx
sess = InferenceSession(os.path.join('onnx', model_name))

audio = sess.run(None, dict(
    input_ids=tokens,
    style=ref_s,
    speed=np.ones(1, dtype=np.float32),
))[0]

Optionally, save the audio to a file:

import scipy.io.wavfile as wavfile
wavfile.write('audio.wav', 24000, audio[0])

Quantizations

The model is resilient to quantization, enabling efficient high-quality speech synthesis at a fraction of the original model size.

How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born.

Model Size (MB) Sample
model.onnx (fp32) 326
model_fp16.onnx (fp16) 163
model_quantized.onnx (8-bit) 92.4
model_q8f16.onnx (Mixed precision) 86
model_uint8.onnx (8-bit & mixed precision) 177
model_uint8f16.onnx (Mixed precision) 114
model_q4.onnx (4-bit matmul) 305
model_q4f16.onnx (4-bit matmul & fp16 weights) 154
Downloads last month
19,783
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support text-to-speech models for transformers.js library.

Model tree for onnx-community/Kokoro-82M-ONNX

Finetuned
hexgrad/Kokoro-82M
Quantized
(9)
this model

Spaces using onnx-community/Kokoro-82M-ONNX 17