mlinmg commited on
Commit
7f33eb9
·
verified ·
1 Parent(s): 166cc37

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +208 -3
README.md CHANGED
@@ -1,3 +1,208 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: audio-text-to-text
6
+ tags:
7
+ - multimodal
8
+ library_name: transformers
9
+ base_model:
10
+ - Qwen/Qwen2-Audio-7B-Instruct
11
+ ---
12
+
13
+ To launch in vllm run
14
+ ```bash
15
+ vllm serve mlinmg/Qwen-2-Audio-Instruct-dynamic-fp8
16
+ ```
17
+
18
+ # Qwen/Qwen2-Audio-7B-Instruct-FP8
19
+
20
+ ## Introduction
21
+
22
+ Qwen2-Audio is the new series of Qwen large audio-language models. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes:
23
+
24
+ * voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input;
25
+
26
+ * audio analysis: users could provide audio and text instructions for analysis during the interaction;
27
+
28
+ We release Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct, which are pretrained model and chat model respectively.
29
+
30
+ For more details, please refer to our [Blog](https://qwenlm.github.io/blog/qwen2-audio/), [GitHub](https://github.com/QwenLM/Qwen2-Audio), and [Report](https://www.arxiv.org/abs/2407.10759).
31
+ <br>
32
+
33
+
34
+ ## Requirements
35
+ The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
36
+ ```
37
+ KeyError: 'qwen2-audio'
38
+ ```
39
+
40
+ ## Quickstart
41
+
42
+ In the following, we demonstrate how to use `Qwen2-Audio-7B-Instruct` for the inference, supporting both voice chat and audio analysis modes. Note that we have used the ChatML format for dialog, in this demo we show how to leverage `apply_chat_template` for this purpose.
43
+
44
+ ### Voice Chat Inference
45
+ In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input:
46
+ ```python
47
+ from io import BytesIO
48
+ from urllib.request import urlopen
49
+ import librosa
50
+ from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
51
+
52
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
53
+ model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")
54
+
55
+ conversation = [
56
+ {"role": "user", "content": [
57
+ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
58
+ ]},
59
+ {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
60
+ {"role": "user", "content": [
61
+ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
62
+ ]},
63
+ ]
64
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
65
+ audios = []
66
+ for message in conversation:
67
+ if isinstance(message["content"], list):
68
+ for ele in message["content"]:
69
+ if ele["type"] == "audio":
70
+ audios.append(librosa.load(
71
+ BytesIO(urlopen(ele['audio_url']).read()),
72
+ sr=processor.feature_extractor.sampling_rate)[0]
73
+ )
74
+
75
+ inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
76
+ inputs.input_ids = inputs.input_ids.to("cuda")
77
+
78
+ generate_ids = model.generate(**inputs, max_length=256)
79
+ generate_ids = generate_ids[:, inputs.input_ids.size(1):]
80
+
81
+ response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
82
+ ```
83
+
84
+ ### Audio Analysis Inference
85
+ In the audio analysis, users could provide both audio and text instructions for analysis:
86
+ ```python
87
+ from io import BytesIO
88
+ from urllib.request import urlopen
89
+ import librosa
90
+ from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
91
+
92
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
93
+ model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")
94
+
95
+ conversation = [
96
+ {'role': 'system', 'content': 'You are a helpful assistant.'},
97
+ {"role": "user", "content": [
98
+ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
99
+ {"type": "text", "text": "What's that sound?"},
100
+ ]},
101
+ {"role": "assistant", "content": "It is the sound of glass shattering."},
102
+ {"role": "user", "content": [
103
+ {"type": "text", "text": "What can you do when you hear that?"},
104
+ ]},
105
+ {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
106
+ {"role": "user", "content": [
107
+ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
108
+ {"type": "text", "text": "What does the person say?"},
109
+ ]},
110
+ ]
111
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
112
+ audios = []
113
+ for message in conversation:
114
+ if isinstance(message["content"], list):
115
+ for ele in message["content"]:
116
+ if ele["type"] == "audio":
117
+ audios.append(
118
+ librosa.load(
119
+ BytesIO(urlopen(ele['audio_url']).read()),
120
+ sr=processor.feature_extractor.sampling_rate)[0]
121
+ )
122
+
123
+ inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
124
+ inputs.input_ids = inputs.input_ids.to("cuda")
125
+
126
+ generate_ids = model.generate(**inputs, max_length=256)
127
+ generate_ids = generate_ids[:, inputs.input_ids.size(1):]
128
+
129
+ response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
130
+ ```
131
+
132
+ ### Batch Inference
133
+ We also support batch inference:
134
+ ```python
135
+ from io import BytesIO
136
+ from urllib.request import urlopen
137
+ import librosa
138
+ from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
139
+
140
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
141
+ model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")
142
+
143
+ conversation1 = [
144
+ {"role": "user", "content": [
145
+ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
146
+ {"type": "text", "text": "What's that sound?"},
147
+ ]},
148
+ {"role": "assistant", "content": "It is the sound of glass shattering."},
149
+ {"role": "user", "content": [
150
+ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"},
151
+ {"type": "text", "text": "What can you hear?"},
152
+ ]}
153
+ ]
154
+
155
+ conversation2 = [
156
+ {"role": "user", "content": [
157
+ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
158
+ {"type": "text", "text": "What does the person say?"},
159
+ ]},
160
+ ]
161
+
162
+ conversations = [conversation1, conversation2]
163
+
164
+ text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations]
165
+
166
+ audios = []
167
+ for conversation in conversations:
168
+ for message in conversation:
169
+ if isinstance(message["content"], list):
170
+ for ele in message["content"]:
171
+ if ele["type"] == "audio":
172
+ audios.append(
173
+ librosa.load(
174
+ BytesIO(urlopen(ele['audio_url']).read()),
175
+ sr=processor.feature_extractor.sampling_rate)[0]
176
+ )
177
+
178
+ inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
179
+ inputs['input_ids'] = inputs['input_ids'].to("cuda")
180
+ inputs.input_ids = inputs.input_ids.to("cuda")
181
+
182
+ generate_ids = model.generate(**inputs, max_length=256)
183
+ generate_ids = generate_ids[:, inputs.input_ids.size(1):]
184
+
185
+ response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
186
+ ```
187
+
188
+ ## Citation
189
+
190
+ If you find our work helpful, feel free to give us a cite.
191
+
192
+ ```BibTeX
193
+ @article{Qwen2-Audio,
194
+ title={Qwen2-Audio Technical Report},
195
+ author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo, Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
196
+ journal={arXiv preprint arXiv:2407.10759},
197
+ year={2024}
198
+ }
199
+ ```
200
+
201
+ ```BibTeX
202
+ @article{Qwen-Audio,
203
+ title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
204
+ author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie and Zhou, Chang and Zhou, Jingren},
205
+ journal={arXiv preprint arXiv:2311.07919},
206
+ year={2023}
207
+ }
208
+ ```