giantPanda0906 nielsr HF Staff commited on
Commit
2aa363d
·
verified ·
1 Parent(s): cbd9f0a

Improve model card: Add `transformers` library, `any-to-any` pipeline tag, and update links (#5)

Browse files

- Improve model card: Add `transformers` library, `any-to-any` pipeline tag, and update links (7bed6717f1ce2cf58039e45882af3c9394cc356d)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +9 -6
README.md CHANGED
@@ -1,5 +1,7 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
 
5
  <div align="center">
@@ -8,7 +10,7 @@ license: apache-2.0
8
 
9
  <div align="center" style="line-height: 1;">
10
  <a href="https://github.com/stepfun-ai/Step-Audio2" target="_blank"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-StepFun-white?logo=github&logoColor=white"/></a> &ensp;
11
- <a href="https://stepfun.com/" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-StepFun-white?logo=StepFun&logoColor=white"/></a> &ensp;
12
  <a href="https://x.com/StepFun_ai" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-StepFun-white?logo=x&logoColor=white"/></a> &ensp;
13
  <a href="https://discord.com/invite/XHheP5Fn" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-StepFun-white?logo=discord&logoColor=white"/></a>
14
  </div>
@@ -24,7 +26,7 @@ license: apache-2.0
24
  ## Introduction
25
 
26
 
27
- Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
28
 
29
  - **Advanced Speech and Audio Understanding**: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.
30
 
@@ -32,7 +34,7 @@ Step-Audio 2 is an end-to-end multi-modal large language model designed for indu
32
 
33
  - **Tool Calling and Multimodal RAG**: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.
34
 
35
- - **State-of-the-Art Performance**: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and [Technical Report](https://arxiv.org/pdf/2507.16632)).
36
 
37
  + **Open-source**: [Step-Audio 2 mini](https://huggingface.co/stepfun-ai/Step-Audio-2-mini) and [Step-Audio 2 mini Base](https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base) are released under [Apache 2.0](LICENSE) license.
38
 
@@ -198,6 +200,7 @@ CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A ind
198
  <td align="center">7.01</td>
199
  <td align="center">2.68</td>
200
  <td align="center"><strong>2.53</strong></td>
 
201
  </tr>
202
  <tr>
203
  <td align="left">KeSpeech phase1</td>
@@ -758,7 +761,7 @@ URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation,
758
  <td align="center"><strong>83.32</strong></td>
759
  <td align="center"><strong>91.05</strong></td>
760
  <td align="center"><strong>75.45</strong></td>
761
- <td align="center"><strong>86.08</strong></td>
762
  <td align="center">68.25</td>
763
  <td align="center">74.78</td>
764
  <td align="center"><strong>63.18</strong></td>
@@ -838,7 +841,7 @@ URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation,
838
  <td align="center">60.12</td>
839
  <td align="center">77.65</td>
840
  <td align="center">61.25</td>
841
- <td align="center">58.79</td>
842
  <td align="center">61.94</td>
843
  <td align="center">63.80</td>
844
  </tr>
@@ -866,4 +869,4 @@ The model and code in the repository is licensed under [Apache 2.0](LICENSE) Lic
866
  primaryClass={cs.CL},
867
  url={https://arxiv.org/abs/2507.16632},
868
  }
869
- ```
 
1
  ---
2
  license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: any-to-any
5
  ---
6
 
7
  <div align="center">
 
10
 
11
  <div align="center" style="line-height: 1;">
12
  <a href="https://github.com/stepfun-ai/Step-Audio2" target="_blank"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-StepFun-white?logo=github&logoColor=white"/></a> &ensp;
13
+ <a href="https://www.stepfun.com/docs/en/step-audio2" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-StepFun-white?logo=StepFun&logoColor=white"/></a> &ensp;
14
  <a href="https://x.com/StepFun_ai" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-StepFun-white?logo=x&logoColor=white"/></a> &ensp;
15
  <a href="https://discord.com/invite/XHheP5Fn" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-StepFun-white?logo=discord&logoColor=white"/></a>
16
  </div>
 
26
  ## Introduction
27
 
28
 
29
+ Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation, presented in the paper [Step-Audio 2 Technical Report](https://huggingface.co/papers/2507.16632).
30
 
31
  - **Advanced Speech and Audio Understanding**: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.
32
 
 
34
 
35
  - **Tool Calling and Multimodal RAG**: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.
36
 
37
+ - **State-of-the-Art Performance**: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and [Technical Report](https://huggingface.co/papers/2507.16632)).
38
 
39
  + **Open-source**: [Step-Audio 2 mini](https://huggingface.co/stepfun-ai/Step-Audio-2-mini) and [Step-Audio 2 mini Base](https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base) are released under [Apache 2.0](LICENSE) license.
40
 
 
200
  <td align="center">7.01</td>
201
  <td align="center">2.68</td>
202
  <td align="center"><strong>2.53</strong></td>
203
+ <td align="center">2.53</td>
204
  </tr>
205
  <tr>
206
  <td align="left">KeSpeech phase1</td>
 
761
  <td align="center"><strong>83.32</strong></td>
762
  <td align="center"><strong>91.05</strong></td>
763
  <td align="center"><strong>75.45</strong></td>
764
+ <align="center"><strong>86.08</strong></td>
765
  <td align="center">68.25</td>
766
  <td align="center">74.78</td>
767
  <td align="center"><strong>63.18</strong></td>
 
841
  <td align="center">60.12</td>
842
  <td align="center">77.65</td>
843
  <td align="center">61.25</td>
844
+ <align="center">58.79</td>
845
  <td align="center">61.94</td>
846
  <td align="center">63.80</td>
847
  </tr>
 
869
  primaryClass={cs.CL},
870
  url={https://arxiv.org/abs/2507.16632},
871
  }
872
+ ```