diff --git "a/README.md" "b/README.md" --- "a/README.md" +++ "b/README.md" @@ -1,667 +1,187 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - README.md · ibm-granite/granite-speech-3.3-8b at main - - - - - - - -
- -
Hugging Face's logo - -
- -
-
- -
-
- - - -
- - - -
-

-
- - - - -
- - -ibm-granite - -
/
- - -
-
- - - - -
-
- -
- granite-speech -

-
+--- +license: apache-2.0 +language: +- en +base_model: +- ibm-granite/granite-3.3-2b-instruct +library_name: transformers +--- +# Granite-speech-3.3-2b - +**Model Summary:** +Granite-speech-3.3-2b is a compact and efficient speech-language model, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). Granite-speech-3.3-2b uses a two-pass design, unlike integrated models that combine speech and language into a single pass. Initial calls to granite-speech-3.3-2b will transcribe audio files into text. To process the transcribed text using the underlying Granite language model, users must make a second call as each step must be explicitly initiated. - Automatic Speech Recognition - +The model was trained on a collection of public corpora comprising diverse datasets for ASR and AST as well as synthetic datasets tailored to support the speech translation task. Granite-speech-3.3-2b was trained by modality aligning granite-3.3-2b-instruct (https://huggingface.co/ibm-granite/granite-3.3-2b-instruct) to speech on publicly available open source corpora containing audio inputs and text targets. -
+We are currently investigating an issue with greedy decoding (```num_beams=1```); the model performs reliably with beam sizes > 1, which we recommend for all use cases. +Additionally, the model may occasionally hallucinate on very short audio inputs (<0.1s). These issues are under active investigation, and we will update guidance as fixes become available. - +**Evaluations:** - Transformers - +We evaluated granite-speech-3.3-2b alongside other speech-language models (SLMs) in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including AST for En-X translation. -
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/7n0soblI3pCISpHbwFHI8.png) - +![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/6m5wBbl2UTM-MWO1-f8Pf.png) - Safetensors - +![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/cVzCIuH0x_8W7Pz6QHwE8.png) -
- +**Release Date**: April 28, 2025 - +**License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) - English - +**Supported Languages:** +English -
+**Intended Use:** +The model is intended to be used in enterprise applications that involve processing of speech inputs. In particular, the model is well-suited for English speech-to-text and speech translations from English to some major European languages such as French, Spanish, Italian, German, Portuguese as well as Japanese and Mandarin. For tasks that exclusively involve text-based input, we suggest using our Granite large language models, which are optimized for text-only processing and offer superior performance compared to this model. - +## Generation: - granite_speech - +Granite Speech model is supported natively in `transformers` from the `main` branch. Below is a simple example of how to use the `granite-speech-3.3-2b` model. -
- - - -
+device = "cuda" if torch.cuda.is_available() else "cpu" -
- - - - -
- - - -
- - - - - - - - - - - - - - - - - - - - -
- - - -
- - - - - -
-
- - - -
- - - - - - - - - - - - - - -
- - -
- - - -
- -
-
- - -
- -
- - - -
-
granite-speech-3.3-8b - / - README.md -
-
- -
-
gsaon's picture - -
Update README.md
- df28cad - verified -
-
-
|
- - raw -
- history - - blame - - edit - - delete - - -
- -
-
- -
- 10.9 kB
- -
- -
metadata
-
license: apache-2.0
-language:
-  - en
-base_model:
-  - ibm-granite/granite-3.3-8b-instruct
-library_name: transformers
-
-

- - - - - Granite-speech-3.3-8b - -

-

Model Summary: -Granite-speech-3.3-8b is a compact and efficient speech-language model, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). Granite-speech-3.3-8b uses a two-pass design, unlike integrated models that combine speech and language into a single pass. Initial calls to granite-speech-3.3-8b will transcribe audio files into text. To process the transcribed text using the underlying Granite language model, users must make a second call as each step must be explicitly initiated.

-

The model was trained on a collection of public corpora comprising diverse datasets for ASR and AST as well as synthetic datasets tailored to support the speech translation task. Granite-speech-3.3 was trained by modality aligning granite-3.3-8b-instruct (https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) to speech on publicly available open source corpora containing audio inputs and text targets.

-

We are currently investigating an issue with greedy decoding (num_beams=1); the model performs reliably with beam sizes > 1, which we recommend for all use cases. -Additionally, the model may occasionally hallucinate on very short audio inputs (<0.1s). These issues are under active investigation, and we will update guidance as fixes become available.

-

Evaluations:

-

We evaluated granite-speech-3.3-8b alongside other speech-language models (SLMs) in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including AST for En-X translation.

-

image/png

-

image/png

-

image/png

-

Release Date: April 15, 2025

-

License: Apache 2.0

-

Supported Languages: -English

-

Intended Use: -The model is intended to be used in enterprise applications that involve processing of speech inputs. In particular, the model is well-suited for English speech-to-text and speech translations from English to some major European languages such as French, Spanish, Italian, German, Portuguese as well as Japanese and Mandarin. For tasks that exclusively involve text-based input, we suggest using our Granite large language models, which are optimized for text-only processing and offer superior performance compared to this model.

-

- - - - - Generation: - -

-

Granite Speech model is supported natively in transformers from the main branch. Below is a simple example of how to use the granite-speech-3.3-8b model.

-

- - - - - Usage with transformers - -

-

First, make sure to build the latest version of transformers from source:

-
pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile
-
-

Then run the code:

-
import torch
-import torchaudio
-from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
-from huggingface_hub import hf_hub_download
-
-device = "cuda" if torch.cuda.is_available() else "cpu"
-
-model_name = "ibm-granite/granite-speech-3.3-8b"
+model_name = "ibm-granite/granite-speech-3.3-2b"
 speech_granite_processor = AutoProcessor.from_pretrained(
     model_name)
 tokenizer = speech_granite_processor.tokenizer
 speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
     model_name).to(device)
 
-# prepare speech and text prompt, using the appropriate prompt template
+# prepare speech and text prompt, using the appropriate prompt template
 
-audio_path = hf_hub_download(repo_id=model_name, filename='10226_10111_000000.wav')
-wav, sr = torchaudio.load(audio_path, normalize=True)
-assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz
+audio_path = hf_hub_download(repo_id=model_name, filename='10226_10111_000000.wav')
+wav, sr = torchaudio.load(audio_path, normalize=True)
+assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz
 
-# create text prompt
+# create text prompt
 chat = [
     {
-        "role": "system",
-        "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
+        "role": "system",
+        "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: April 28, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
     },
     {
-        "role": "user",
-        "content": "<|audio|>can you transcribe the speech into a written format?",
+        "role": "user",
+        "content": "<|audio|>can you transcribe the speech into a written format?",
     }
 ]
 
 text = tokenizer.apply_chat_template(
-    chat, tokenize=False, add_generation_prompt=True
+    chat, tokenize=False, add_generation_prompt=True
 )
 
-# compute audio embeddings
+# compute audio embeddings
 model_inputs = speech_granite_processor(
     text,
     wav,
-    device=device, # Computation device; returned tensors are put on CPU
-    return_tensors="pt",
+    device=device, # Computation device; returned tensors are put on CPU
+    return_tensors="pt",
 ).to(device)
  
 model_outputs = speech_granite.generate(
     **model_inputs,
-    max_new_tokens=200,
-    num_beams=4,
-    do_sample=False,
-    min_length=1,
-    top_p=1.0,
-    repetition_penalty=1.0,
-    length_penalty=1.0,
-    temperature=1.0,
+    max_new_tokens=200,
+    num_beams=4,
+    do_sample=False,
+    min_length=1,
+    top_p=1.0,
+    repetition_penalty=1.0,
+    length_penalty=1.0,
+    temperature=1.0,
     bos_token_id=tokenizer.bos_token_id,
     eos_token_id=tokenizer.eos_token_id,
     pad_token_id=tokenizer.pad_token_id,
 )
 
-# Transformers includes the input IDs in the response.
-num_input_tokens = model_inputs["input_ids"].shape[-1]
-new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)
+# Transformers includes the input IDs in the response.
+num_input_tokens = model_inputs["input_ids"].shape[-1]
+new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)
 
 output_text = tokenizer.batch_decode(
-    new_tokens, add_special_tokens=False, skip_special_tokens=True
+    new_tokens, add_special_tokens=False, skip_special_tokens=True
 )
-print(f"STT output = {output_text[0].upper()}")
-
-

Model Architecture:

-

The architecture of granite-speech-3.3-8b consists of the following components:

-

(1) Speech encoder: 10 conformer blocks trained with Connectionist Temporal Classification (CTC) on character-level targets on the subset containing +print(f"STT output = {output_text[0].upper()}") +``` + +**Model Architecture:** + +The architecture of granite-speech-3.3-2b consists of the following components: + +(1) Speech encoder: 10 conformer blocks trained with Connectionist Temporal Classification (CTC) on character-level targets on the subset containing only ASR corpora (see configuration below). In addition, our CTC encoder uses block-attention with 4-seconds audio blocks and self-conditioned CTC -from the middle layer.

-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Configuration parameterValue
Input dimension160 (80 logmels x 2)
Nb. of layers10
Hidden dimension1024
Nb. of attention heads8
Attention head size128
Convolution kernel size15
Output dimension42
-
-

(2) Speech projector and temporal downsampler (speech-text modality adapter): we use a 2-layer window query transformer (q-former) operating on +from the middle layer. + +| Configuration parameter | Value | +|-----------------|----------------------| +| Input dimension | 160 (80 logmels x 2) | +| Nb. of layers | 10 | +| Hidden dimension | 1024 | +| Nb. of attention heads | 8 | +| Attention head size | 128 | +| Convolution kernel size | 15 | +| Output dimension | 42 | + +(2) Speech projector and temporal downsampler (speech-text modality adapter): we use a 2-layer window query transformer (q-former) operating on blocks of 15 1024-dimensional acoustic embeddings coming out of the last conformer block of the speech encoder that get downsampled by a factor of 5 using 3 trainable queries per block and per layer. The total temporal downsampling factor is 10 (2x from the encoder and 5x from the projector) resulting in a 10Hz acoustic embeddings rate for the LLM. The encoder, projector and LoRA adapters were fine-tuned/trained jointly on all the -corpora mentioned under Training Data.

-

(3) Large language model: granite-3.3-8b-instruct with 128k context length (https://huggingface.co/ibm-granite/granite-3.3-8b-instruct).

-

(4) LoRA adapters: rank=64 applied to the query, value projection matrices

-

Training Data:

-

Overall, our training data is largely comprised of two key sources: (1) publicly available datasets (2) Synthetic data created from publicly +corpora mentioned under **Training Data**. + +(3) Large language model: granite-3.3-2b-instruct with 128k context length (https://huggingface.co/ibm-granite/granite-3.3-2b-instruct). + +(4) LoRA adapters: rank=64 applied to the query, value projection matrices + +**Training Data:** + +Overall, our training data is largely comprised of two key sources: (1) publicly available datasets (2) Synthetic data created from publicly available datasets specifically targeting the speech translation task. A detailed description of the training datasets can be found in the table -below:

-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NameTaskNb. hoursSource
CommonVoice-17 EnglishASR2600https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
MLS EnglishASR44000https://huggingface.co/datasets/facebook/multilingual_librispeech
LibrispeechASR1000https://huggingface.co/datasets/openslr/librispeech_asr
VoxPopuli EnglishASR500https://huggingface.co/datasets/facebook/voxpopuli
AMIASR100https://huggingface.co/datasets/edinburghcstr/ami
YODAS EnglishASR10000https://huggingface.co/datasets/espnet/yodas
Switchboard EnglishASR260https://catalog.ldc.upenn.edu/LDC97S62
CallHome EnglishASR18https://catalog.ldc.upenn.edu/LDC97T14
FisherASR2000https://catalog.ldc.upenn.edu/LDC2004S13
Voicemail part IASR40https://catalog.ldc.upenn.edu/LDC98S77
Voicemail part IIASR40https://catalog.ldc.upenn.edu/LDC2002S35
CommonVoice-17 En->De,Es,Fr,It,Ja,Pt,ZhAST2600*7Translations with Phi-4 and MADLAD
-
-

Infrastructure: +below: + +| Name | Task | Nb. hours | Source | +|-----------|--------------|----------------|--------------| +| CommonVoice-17 English | ASR | 2600 | https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0 | +| MLS English | ASR | 44000 | https://huggingface.co/datasets/facebook/multilingual_librispeech | +| Librispeech | ASR | 1000 | https://huggingface.co/datasets/openslr/librispeech_asr | +| VoxPopuli English | ASR | 500 | https://huggingface.co/datasets/facebook/voxpopuli | +| AMI | ASR | 100 | https://huggingface.co/datasets/edinburghcstr/ami | +| YODAS English | ASR | 10000 | https://huggingface.co/datasets/espnet/yodas | +| Switchboard English | ASR | 260 | https://catalog.ldc.upenn.edu/LDC97S62 | +| CallHome English | ASR | 18 | https://catalog.ldc.upenn.edu/LDC97T14 | +| Fisher | ASR | 2000 | https://catalog.ldc.upenn.edu/LDC2004S13 | +| Voicemail part I | ASR | 40 | https://catalog.ldc.upenn.edu/LDC98S77 | +| Voicemail part II | ASR | 40 | https://catalog.ldc.upenn.edu/LDC2002S35 | +| CommonVoice-17 En->De,Es,Fr,It,Ja,Pt,Zh | AST | 2600*7 | Translations with Phi-4 and MADLAD | + +**Infrastructure:** We train Granite Speech using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. The training of this particular model was completed in 9 days on 32 -H100 GPUs.

-

Ethical Considerations and Limitations:

-

Users should be aware that the model may produce unreliable outputs when decoding with num_beams=1 or when processing extremely short audio clips (<0.1s). -Until further updates are released, we recommend using beam sizes greater than 1 and avoiding inputs below the 0.1-second threshold to ensure more consistent performance.

-

The use of Large Speech and Language Models may involve risks and ethical considerations that people should be aware of. These risks may include bias and fairness, misinformation, and autonomous decision-making. We urge the community to use granite-speech-3.3-8b in a manner consistent with IBM's Responsible Use Guide or similar responsible use structures. IBM recommends using this model for automatic speech recognition tasks. The model's modular design improves safety by limiting how audio inputs can influence the system. If an unfamiliar or malformed prompt is received, the model simply echoes it with its transcription. This minimizes the risk of adversarial inputs, unlike integrated models that directly interpret audio and may be more exposed to such attacks. Note that more general speech tasks may pose higher inherent risks of triggering unwanted outputs.

-

To enhance safety, we recommend using granite-speech-3.3-8b alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety.

-

Resources

- -
-
- -
- - - - - - - +H100 GPUs. + +**Ethical Considerations and Limitations:** + +Users should be aware that the model may produce unreliable outputs when decoding with ```num_beams=1``` or when processing extremely short audio clips (<0.1s). +Until further updates are released, we recommend using beam sizes greater than 1 and avoiding inputs below the 0.1-second threshold to ensure more consistent performance. + +The use of Large Speech and Language Models may involve risks and ethical considerations that people should be aware of. These risks may include bias and fairness, misinformation, and autonomous decision-making. We urge the community to use granite-speech-3.3-2b in a manner consistent with IBM's Responsible Use Guide or similar responsible use structures. IBM recommends using this model for automatic speech recognition tasks. The model's modular design improves safety by limiting how audio inputs can influence the system. If an unfamiliar or malformed prompt is received, the model simply echoes it with its transcription. This minimizes the risk of adversarial inputs, unlike integrated models that directly interpret audio and may be more exposed to such attacks. Note that more general speech tasks may pose higher inherent risks of triggering unwanted outputs. + +To enhance safety, we recommend using granite-speech-3.3-2b alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety. + +**Resources** +- ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite +- 🚀 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ +- 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources \ No newline at end of file