EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Paper
•
2509.09174
•
Published
•
61
🐈⬛ Github | 📃 Paper | 🚀 Space (8B) | 📊 EchoX-Dialougues | 📊 EchoX-Dialogues-Plus
EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing Echo Training, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 10k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
Load the EchoX model and run inference with your audio files as shown in the GitHub repository.
@misc{zhang2025echoxmitigatingacousticsemanticgap,
title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs},
author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},
year={2025},
eprint={2509.09174},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.09174},
}
Totally Free + Zero Barriers + No Login Required