Create README.md

<div align="center">

## 🎙 [Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model](https://huggingface.co/papers/2309.11000)

[Xinyu Zhou (周欣宇)](https://www.linkedin.com/in/xinyu-zhou2000/), [Delong Chen (陈德龙)](https://chendelong.world/), [Yudong Chen (陈玉东)](https://rwxy.cuc.edu.cn/2019/0730/c5134a133504/pagem.htm)

[ArXiv](https://arxiv.org/abs/2309.11000) | [Poster](doc/YFRSW_Poster.pdf) | [Notebook](prosody_prediction.ipynb) | [Github](https://github.com/XinyuZhou2000/Spoken-Dialogue)

</div>

This project explores the potential of constructing an AI spoken dialogue system that *"thinks how to respond"* and *"thinks how to speak"* simultaneously, which more closely aligns with the human speech production process compared to the current cascade pipeline of independent chatbot and Text-to-Speech (TTS) modules.

We hypothesize that *Large Language Models (LLMs)* with billions of parameters possess significant speech understanding capabilities and can jointly model dialogue responses and linguistic features. We investigate the task of Prosodic structure prediction (PSP), a typical front-end task in TTS, demonstrating the speech understanding ability of LLMs.

Files changed (1) hide show

README.md +14 -0

README.md CHANGED Viewed

	@@ -0,0 +1,14 @@

+<div align="center">
+🎙 [**Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model**](https://huggingface.co/papers/2309.11000)
+[Xinyu Zhou (周欣宇)](https://www.linkedin.com/in/xinyu-zhou2000/), [Delong Chen (陈德龙)](https://chendelong.world/), [Yudong Chen (陈玉东)](https://rwxy.cuc.edu.cn/2019/0730/c5134a133504/pagem.htm)
+[ArXiv](https://arxiv.org/abs/2309.11000) | [Poster](doc/YFRSW_Poster.pdf) | [Notebook](prosody_prediction.ipynb) | [Github](https://github.com/XinyuZhou2000/Spoken-Dialogue)
+</div>
+This project explores the potential of constructing an AI spoken dialogue system that *"thinks how to respond"* and *"thinks how to speak"* simultaneously, which more closely aligns with the human speech production process compared to the current cascade pipeline of independent chatbot and Text-to-Speech (TTS) modules.
+We hypothesize that *Large Language Models (LLMs)* with billions of parameters possess significant speech understanding capabilities and can jointly model dialogue responses and linguistic features. We investigate the task of Prosodic structure prediction (PSP), a typical front-end task in TTS, demonstrating the speech understanding ability of LLMs.