Create README.md
Browse files<div align="center">
## 🎙 [Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model](https://huggingface.co/papers/2309.11000)
[Xinyu Zhou (周欣宇)](https://www.linkedin.com/in/xinyu-zhou2000/), [Delong Chen (陈德龙)](https://chendelong.world/), [Yudong Chen (陈玉东)](https://rwxy.cuc.edu.cn/2019/0730/c5134a133504/pagem.htm)
[ArXiv](https://arxiv.org/abs/2309.11000) | [Poster](doc/YFRSW_Poster.pdf) | [Notebook](prosody_prediction.ipynb) | [Github](https://github.com/XinyuZhou2000/Spoken-Dialogue)
</div>
This project explores the potential of constructing an AI spoken dialogue system that *"thinks how to respond"* and *"thinks how to speak"* simultaneously, which more closely aligns with the human speech production process compared to the current cascade pipeline of independent chatbot and Text-to-Speech (TTS) modules.
We hypothesize that *Large Language Models (LLMs)* with billions of parameters possess significant speech understanding capabilities and can jointly model dialogue responses and linguistic features. We investigate the task of Prosodic structure prediction (PSP), a typical front-end task in TTS, demonstrating the speech understanding ability of LLMs.
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<div align="center">
|
2 |
+
|
3 |
+
🎙 [**Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model**](https://huggingface.co/papers/2309.11000)
|
4 |
+
|
5 |
+
[Xinyu Zhou (周欣宇)](https://www.linkedin.com/in/xinyu-zhou2000/), [Delong Chen (陈德龙)](https://chendelong.world/), [Yudong Chen (陈玉东)](https://rwxy.cuc.edu.cn/2019/0730/c5134a133504/pagem.htm)
|
6 |
+
|
7 |
+
|
8 |
+
[ArXiv](https://arxiv.org/abs/2309.11000) | [Poster](doc/YFRSW_Poster.pdf) | [Notebook](prosody_prediction.ipynb) | [Github](https://github.com/XinyuZhou2000/Spoken-Dialogue)
|
9 |
+
|
10 |
+
</div>
|
11 |
+
|
12 |
+
This project explores the potential of constructing an AI spoken dialogue system that *"thinks how to respond"* and *"thinks how to speak"* simultaneously, which more closely aligns with the human speech production process compared to the current cascade pipeline of independent chatbot and Text-to-Speech (TTS) modules.
|
13 |
+
|
14 |
+
We hypothesize that *Large Language Models (LLMs)* with billions of parameters possess significant speech understanding capabilities and can jointly model dialogue responses and linguistic features. We investigate the task of Prosodic structure prediction (PSP), a typical front-end task in TTS, demonstrating the speech understanding ability of LLMs.
|