Improve model card: Add `transformers` library, `any-to-any` pipeline tag, and update links (#5)

Browse files

- Improve model card: Add `transformers` library, `any-to-any` pipeline tag, and update links (7bed6717f1ce2cf58039e45882af3c9394cc356d)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +9 -6

README.md CHANGED Viewed

@@ -1,5 +1,7 @@
 ---
 license: apache-2.0
 ---
 <div align="center">
@@ -8,7 +10,7 @@ license: apache-2.0
 <div align="center" style="line-height: 1;">
   <a href="https://github.com/stepfun-ai/Step-Audio2" target="_blank"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-StepFun-white?logo=github&logoColor=white"/></a> &ensp;
-  <a href="https://stepfun.com/" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-StepFun-white?logo=StepFun&logoColor=white"/></a> &ensp;
   <a href="https://x.com/StepFun_ai" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-StepFun-white?logo=x&logoColor=white"/></a> &ensp;
   <a href="https://discord.com/invite/XHheP5Fn" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-StepFun-white?logo=discord&logoColor=white"/></a>
 </div>
@@ -24,7 +26,7 @@ license: apache-2.0
 ## Introduction
-Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
 - **Advanced Speech and Audio Understanding**: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.
@@ -32,7 +34,7 @@ Step-Audio 2 is an end-to-end multi-modal large language model designed for indu
 - **Tool Calling and Multimodal RAG**: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.
-- **State-of-the-Art Performance**: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and [Technical Report](https://arxiv.org/pdf/2507.16632)).
 + **Open-source**: [Step-Audio 2 mini](https://huggingface.co/stepfun-ai/Step-Audio-2-mini) and [Step-Audio 2 mini Base](https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base) are released under [Apache 2.0](LICENSE) license.
@@ -198,6 +200,7 @@ CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A ind
       <td align="center">7.01</td>
       <td align="center">2.68</td>
       <td align="center"><strong>2.53</strong></td>
     </tr>
     <tr>
       <td align="left">KeSpeech phase1</td>
@@ -758,7 +761,7 @@ URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation,
       <td align="center"><strong>83.32</strong></td>
       <td align="center"><strong>91.05</strong></td>
       <td align="center"><strong>75.45</strong></td>
-      <td align="center"><strong>86.08</strong></td>
       <td align="center">68.25</td>
       <td align="center">74.78</td>
       <td align="center"><strong>63.18</strong></td>
@@ -838,7 +841,7 @@ URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation,
       <td align="center">60.12</td>
       <td align="center">77.65</td>
       <td align="center">61.25</td>
-      <td align="center">58.79</td>
       <td align="center">61.94</td>
       <td align="center">63.80</td>
     </tr>
@@ -866,4 +869,4 @@ The model and code in the repository is licensed under [Apache 2.0](LICENSE) Lic
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2507.16632},
 }
-```

 ---
 license: apache-2.0
+library_name: transformers
+pipeline_tag: any-to-any
 ---
 <div align="center">
 <div align="center" style="line-height: 1;">
   <a href="https://github.com/stepfun-ai/Step-Audio2" target="_blank"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-StepFun-white?logo=github&logoColor=white"/></a> &ensp;
+  <a href="https://www.stepfun.com/docs/en/step-audio2" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-StepFun-white?logo=StepFun&logoColor=white"/></a> &ensp;
   <a href="https://x.com/StepFun_ai" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-StepFun-white?logo=x&logoColor=white"/></a> &ensp;
   <a href="https://discord.com/invite/XHheP5Fn" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-StepFun-white?logo=discord&logoColor=white"/></a>
 </div>
 ## Introduction
+Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation, presented in the paper [Step-Audio 2 Technical Report](https://huggingface.co/papers/2507.16632).
 - **Advanced Speech and Audio Understanding**: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.
 - **Tool Calling and Multimodal RAG**: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.
+- **State-of-the-Art Performance**: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and [Technical Report](https://huggingface.co/papers/2507.16632)).
 + **Open-source**: [Step-Audio 2 mini](https://huggingface.co/stepfun-ai/Step-Audio-2-mini) and [Step-Audio 2 mini Base](https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base) are released under [Apache 2.0](LICENSE) license.
       <td align="center">7.01</td>
       <td align="center">2.68</td>
       <td align="center"><strong>2.53</strong></td>
+      <td align="center">2.53</td>
     </tr>
     <tr>
       <td align="left">KeSpeech phase1</td>
       <td align="center"><strong>83.32</strong></td>
       <td align="center"><strong>91.05</strong></td>
       <td align="center"><strong>75.45</strong></td>
+      <align="center"><strong>86.08</strong></td>
       <td align="center">68.25</td>
       <td align="center">74.78</td>
       <td align="center"><strong>63.18</strong></td>
       <td align="center">60.12</td>
       <td align="center">77.65</td>
       <td align="center">61.25</td>
+      <align="center">58.79</td>
       <td align="center">61.94</td>
       <td align="center">63.80</td>
     </tr>
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2507.16632},
 }
+```