Improve model card: Add `transformers` library, `any-to-any` pipeline tag, and update links (#5)
Browse files- Improve model card: Add `transformers` library, `any-to-any` pipeline tag, and update links (7bed6717f1ce2cf58039e45882af3c9394cc356d)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,5 +1,7 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
3 |
---
|
4 |
|
5 |
<div align="center">
|
@@ -8,7 +10,7 @@ license: apache-2.0
|
|
8 |
|
9 |
<div align="center" style="line-height: 1;">
|
10 |
<a href="https://github.com/stepfun-ai/Step-Audio2" target="_blank"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-StepFun-white?logo=github&logoColor=white"/></a>  
|
11 |
-
<a href="https://stepfun.com/" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-StepFun-white?logo=StepFun&logoColor=white"/></a>  
|
12 |
<a href="https://x.com/StepFun_ai" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-StepFun-white?logo=x&logoColor=white"/></a>  
|
13 |
<a href="https://discord.com/invite/XHheP5Fn" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-StepFun-white?logo=discord&logoColor=white"/></a>
|
14 |
</div>
|
@@ -24,7 +26,7 @@ license: apache-2.0
|
|
24 |
## Introduction
|
25 |
|
26 |
|
27 |
-
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
|
28 |
|
29 |
- **Advanced Speech and Audio Understanding**: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.
|
30 |
|
@@ -32,7 +34,7 @@ Step-Audio 2 is an end-to-end multi-modal large language model designed for indu
|
|
32 |
|
33 |
- **Tool Calling and Multimodal RAG**: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.
|
34 |
|
35 |
-
- **State-of-the-Art Performance**: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and [Technical Report](https://
|
36 |
|
37 |
+ **Open-source**: [Step-Audio 2 mini](https://huggingface.co/stepfun-ai/Step-Audio-2-mini) and [Step-Audio 2 mini Base](https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base) are released under [Apache 2.0](LICENSE) license.
|
38 |
|
@@ -198,6 +200,7 @@ CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A ind
|
|
198 |
<td align="center">7.01</td>
|
199 |
<td align="center">2.68</td>
|
200 |
<td align="center"><strong>2.53</strong></td>
|
|
|
201 |
</tr>
|
202 |
<tr>
|
203 |
<td align="left">KeSpeech phase1</td>
|
@@ -758,7 +761,7 @@ URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation,
|
|
758 |
<td align="center"><strong>83.32</strong></td>
|
759 |
<td align="center"><strong>91.05</strong></td>
|
760 |
<td align="center"><strong>75.45</strong></td>
|
761 |
-
<
|
762 |
<td align="center">68.25</td>
|
763 |
<td align="center">74.78</td>
|
764 |
<td align="center"><strong>63.18</strong></td>
|
@@ -838,7 +841,7 @@ URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation,
|
|
838 |
<td align="center">60.12</td>
|
839 |
<td align="center">77.65</td>
|
840 |
<td align="center">61.25</td>
|
841 |
-
<
|
842 |
<td align="center">61.94</td>
|
843 |
<td align="center">63.80</td>
|
844 |
</tr>
|
@@ -866,4 +869,4 @@ The model and code in the repository is licensed under [Apache 2.0](LICENSE) Lic
|
|
866 |
primaryClass={cs.CL},
|
867 |
url={https://arxiv.org/abs/2507.16632},
|
868 |
}
|
869 |
-
```
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
library_name: transformers
|
4 |
+
pipeline_tag: any-to-any
|
5 |
---
|
6 |
|
7 |
<div align="center">
|
|
|
10 |
|
11 |
<div align="center" style="line-height: 1;">
|
12 |
<a href="https://github.com/stepfun-ai/Step-Audio2" target="_blank"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-StepFun-white?logo=github&logoColor=white"/></a>  
|
13 |
+
<a href="https://www.stepfun.com/docs/en/step-audio2" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-StepFun-white?logo=StepFun&logoColor=white"/></a>  
|
14 |
<a href="https://x.com/StepFun_ai" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-StepFun-white?logo=x&logoColor=white"/></a>  
|
15 |
<a href="https://discord.com/invite/XHheP5Fn" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-StepFun-white?logo=discord&logoColor=white"/></a>
|
16 |
</div>
|
|
|
26 |
## Introduction
|
27 |
|
28 |
|
29 |
+
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation, presented in the paper [Step-Audio 2 Technical Report](https://huggingface.co/papers/2507.16632).
|
30 |
|
31 |
- **Advanced Speech and Audio Understanding**: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.
|
32 |
|
|
|
34 |
|
35 |
- **Tool Calling and Multimodal RAG**: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.
|
36 |
|
37 |
+
- **State-of-the-Art Performance**: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and [Technical Report](https://huggingface.co/papers/2507.16632)).
|
38 |
|
39 |
+ **Open-source**: [Step-Audio 2 mini](https://huggingface.co/stepfun-ai/Step-Audio-2-mini) and [Step-Audio 2 mini Base](https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base) are released under [Apache 2.0](LICENSE) license.
|
40 |
|
|
|
200 |
<td align="center">7.01</td>
|
201 |
<td align="center">2.68</td>
|
202 |
<td align="center"><strong>2.53</strong></td>
|
203 |
+
<td align="center">2.53</td>
|
204 |
</tr>
|
205 |
<tr>
|
206 |
<td align="left">KeSpeech phase1</td>
|
|
|
761 |
<td align="center"><strong>83.32</strong></td>
|
762 |
<td align="center"><strong>91.05</strong></td>
|
763 |
<td align="center"><strong>75.45</strong></td>
|
764 |
+
<align="center"><strong>86.08</strong></td>
|
765 |
<td align="center">68.25</td>
|
766 |
<td align="center">74.78</td>
|
767 |
<td align="center"><strong>63.18</strong></td>
|
|
|
841 |
<td align="center">60.12</td>
|
842 |
<td align="center">77.65</td>
|
843 |
<td align="center">61.25</td>
|
844 |
+
<align="center">58.79</td>
|
845 |
<td align="center">61.94</td>
|
846 |
<td align="center">63.80</td>
|
847 |
</tr>
|
|
|
869 |
primaryClass={cs.CL},
|
870 |
url={https://arxiv.org/abs/2507.16632},
|
871 |
}
|
872 |
+
```
|