buyun ryanmiao commited on
Commit
7c5e63b
·
verified ·
1 Parent(s): aa82b18

Update README.md (#1)

Browse files

- Update README.md (b0911100145ef4499e79bc5f92e5842d881c7082)


Co-authored-by: mrh <[email protected]>

Files changed (1) hide show
  1. README.md +166 -7
README.md CHANGED
@@ -1,20 +1,179 @@
1
  ---
2
  license: apache-2.0
3
  ---
4
- # Step-Audio
5
 
6
- Step-Audio是StepFun开源的Step-Audio智能语音交互框架,Step-Audio框架内集成了语音识别、语义理解、对话管理、语音克隆和语音生成。这种统一的架构实现了低端到端延迟,可用于全双工交互,使其适用于实时应用。
7
 
8
- Step-Audio is an open-source intelligent voice interaction framework developed by StepFun. The Step-Audio framework integrates speech recognition, semantic understanding, dialogue management, voice cloning, and speech generation. This unified architecture achieves low end-to-end latency and can be used for full-duplex interaction, making it suitable for real-time applications.
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
- # Step-Audio-Chat
12
 
13
- 本仓库是Step-Audio中的多模态大型语言模型(LLM)部分。它是一个 1300 亿参数的多模态大型语言模型 (LLM),它负责理解和生成人类语音。 该模型经过专门设计,能够无缝集成语音识别、语义理解、对话管理、语音克隆和语音生成等功能。
14
 
15
- This repository contains the Multimodal Large Language Model (LLM) component of Step-Audio. It is a 130 billion parameter multimodal LLM that is responsible for understanding and generating human speech. The model is specifically designed to seamlessly integrate functions such as speech recognition, semantic understanding, dialogue management, voice cloning, and speech generation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
 
18
- 更多信息请参考我们的仓库: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).
19
 
20
  For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).
 
1
  ---
2
  license: apache-2.0
3
  ---
 
4
 
5
+ # 1. Step-Audio-Chat
6
 
7
+ This repository contains the Multimodal Large Language Model (LLM) component of Step-Audio. It is a 130 billion parameter multimodal LLM that is responsible for understanding and generating human speech. The model is specifically designed to seamlessly integrate functions such as speech recognition, semantic understanding, dialogue management, voice cloning, and speech generation.
8
 
9
+ ## 2. Evaluation
10
+ ### 2.1 LLM judge metrics(GPT-4o) on [**StepEval-Audio-360**](https://huggingface.co/datasets/stepfun-ai/StepEval-Audio-360)
11
+ <table>
12
+ <caption>Comparison of fundamental capabilities of voice chat on the StepEval-Audio-360.</caption>
13
+ <thead>
14
+ <tr>
15
+ <th>Model</th>
16
+ <th style="text-align:center">Factuality (% &uarr;)</th>
17
+ <th style="text-align:center">Relevance (% &uarr;)</th>
18
+ <th style="text-align:center">Chat Score &uarr;</th>
19
+ </tr>
20
+ </thead>
21
+ <tbody>
22
+ <tr>
23
+ <td>GLM4-Voice</td>
24
+ <td style="text-align:center">54.7</td>
25
+ <td style="text-align:center">66.4</td>
26
+ <td style="text-align:center">3.49</td>
27
+ </tr>
28
+ <tr>
29
+ <td>Qwen2-Audio</td>
30
+ <td style="text-align:center">22.6</td>
31
+ <td style="text-align:center">26.3</td>
32
+ <td style="text-align:center">2.27</td>
33
+ </tr>
34
+ <tr>
35
+ <td>Moshi<sup>*</sup></td>
36
+ <td style="text-align:center">1.0</td>
37
+ <td style="text-align:center">0</td>
38
+ <td style="text-align:center">1.49</td>
39
+ </tr>
40
+ <tr>
41
+ <td><strong>Step-Audio-Chat</strong></td>
42
+ <td style="text-align:center"><strong>66.4</strong></td>
43
+ <td style="text-align:center"><strong>75.2</strong></td>
44
+ <td style="text-align:center"><strong>4.11</strong></td>
45
+ </tr>
46
+ </tbody>
47
+ </table>
48
 
49
+ *Note: Moshi are marked with "\*" and should be considered for reference only.
50
 
 
51
 
52
+ ### 2.2 Public Test Set
53
+
54
+ <table>
55
+ <thead>
56
+ <tr>
57
+ <th>Model</th>
58
+ <th style="text-align:center">Llama Question</th>
59
+ <th style="text-align:center">Web Questions</th>
60
+ <th style="text-align:center">TriviaQA*</th>
61
+ <th style="text-align:center">ComplexBench</th>
62
+ <th style="text-align:center">HSK-6</th>
63
+ </tr>
64
+ </thead>
65
+ <tbody>
66
+ <tr>
67
+ <td>GLM4-Voice</td>
68
+ <td style="text-align:center">64.7</td>
69
+ <td style="text-align:center">32.2</td>
70
+ <td style="text-align:center">39.1</td>
71
+ <td style="text-align:center">66.0</td>
72
+ <td style="text-align:center">74.0</td>
73
+ </tr>
74
+ <tr>
75
+ <td>Moshi</td>
76
+ <td style="text-align:center">62.3</td>
77
+ <td style="text-align:center">26.6</td>
78
+ <td style="text-align:center">22.8</td>
79
+ <td style="text-align:center">-</td>
80
+ <td style="text-align:center">-</td>
81
+ </tr>
82
+ <tr>
83
+ <td>Freeze-Omni</td>
84
+ <td style="text-align:center">72.0</td>
85
+ <td style="text-align:center">44.7</td>
86
+ <td style="text-align:center">53.9</td>
87
+ <td style="text-align:center">-</td>
88
+ <td style="text-align:center">-</td>
89
+ </tr>
90
+ <tr>
91
+ <td>LUCY</td>
92
+ <td style="text-align:center">59.7</td>
93
+ <td style="text-align:center">29.3</td>
94
+ <td style="text-align:center">27.0</td>
95
+ <td style="text-align:center">-</td>
96
+ <td style="text-align:center">-</td>
97
+ </tr>
98
+ <tr>
99
+ <td>MinMo</td>
100
+ <td style="text-align:center">78.9</td>
101
+ <td style="text-align:center">55.0</td>
102
+ <td style="text-align:center">48.3</td>
103
+ <td style="text-align:center">-</td>
104
+ <td style="text-align:center">-</td>
105
+ </tr>
106
+ <tr>
107
+ <td>Qwen2-Audio</td>
108
+ <td style="text-align:center">52.0</td>
109
+ <td style="text-align:center">27.0</td>
110
+ <td style="text-align:center">37.3</td>
111
+ <td style="text-align:center">54.0</td>
112
+ <td style="text-align:center">-</td>
113
+ </tr>
114
+ <tr>
115
+ <td><strong>Step-Audio-Chat</strong></td>
116
+ <td style="text-align:center"><strong><i>81.0</i></strong></td>
117
+ <td style="text-align:center"><strong>75.1</strong></td>
118
+ <td style="text-align:center"><strong>58.0</strong></td>
119
+ <td style="text-align:center"><strong>74.0</strong></td>
120
+ <td style="text-align:center"><strong>86.0</strong></td>
121
+ </tr>
122
+ </tbody>
123
+ </table>
124
+
125
+ *Note: Results marked with "\*" on TriviaQA dataset are considered for reference only.*
126
+
127
+ *TriviaQA dataset marked with "\*" indicates results are for reference only.*
128
+
129
+ ### 2.3 Audio instruction following
130
+ <table>
131
+ <thead>
132
+ <tr>
133
+ <th rowspan="2">Category</th>
134
+ <th colspan="2" style="text-align:center">Instruction Following</th>
135
+ <th colspan="2" style="text-align:center">Audio Quality</th>
136
+ </tr>
137
+ <tr>
138
+ <th style="text-align:center">GLM-4-Voice</th>
139
+ <th style="text-align:center">Step-Audio</th>
140
+ <th style="text-align:center">GLM-4-Voice</th>
141
+ <th style="text-align:center">Step-Audio</th>
142
+ </tr>
143
+ </thead>
144
+ <tbody>
145
+ <tr>
146
+ <td>Languages</td>
147
+ <td style="text-align:center">1.9</td>
148
+ <td style="text-align:center">3.8</td>
149
+ <td style="text-align:center">2.9</td>
150
+ <td style="text-align:center">3.3</td>
151
+ </tr>
152
+ <tr>
153
+ <td>Role-playing</td>
154
+ <td style="text-align:center">3.8</td>
155
+ <td style="text-align:center">4.2</td>
156
+ <td style="text-align:center">3.2</td>
157
+ <td style="text-align:center">3.6</td>
158
+ </tr>
159
+ <tr>
160
+ <td>Singing / RAP</td>
161
+ <td style="text-align:center">2.1</td>
162
+ <td style="text-align:center">2.4</td>
163
+ <td style="text-align:center">2.4</td>
164
+ <td style="text-align:center">4</td>
165
+ </tr>
166
+ <tr>
167
+ <td>Voice Control</td>
168
+ <td style="text-align:center">3.6</td>
169
+ <td style="text-align:center">4.4</td>
170
+ <td style="text-align:center">3.3</td>
171
+ <td style="text-align:center">4.1</td>
172
+ </tr>
173
+ </tbody>
174
+ </table>
175
 
176
 
177
+ ## 3. More information
178
 
179
  For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).