VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
Paper
•
2509.15969
•
Published
•
3
VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.
pip install voxtream
voxtream \
--prompt-audio assets/audio/male.wav \
--prompt-text "The liquor was first created as 'Brandy Milk', produced with milk, brandy and vanilla." \
--text "In general, however, some method is then needed to evaluate each approximation." \
--output "output_stream.wav"
voxtream \
--prompt-audio assets/audio/female.wav \
--prompt-text "Betty Cooper helps Archie with cleaning a store room, when Reggie attacks her." \
--text "Staff do not always do enough to prevent violence." \
--output "full_stream.wav" \
--full-stream
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
The model was trained on a 9k-hour subset from Emilia and HiFiTTS2 datasets. You can download it here. For more details, please check our paper.
@article{torgashov2025voxtream,
author = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
title = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
journal = {arXiv:2509.15969},
year = {2025}
}
Totally Free + Zero Barriers + No Login Required