File size: 9,737 Bytes
d7f4f41 e93c666 2caf077 e93c666 d7f4f41 2caf077 d7f4f41 2caf077 d7f4f41 2caf077 d7f4f41 45754dd 73dc3b7 d7f4f41 73dc3b7 d7f4f41 73dc3b7 d7f4f41 73dc3b7 d7f4f41 73dc3b7 d7f4f41 73dc3b7 d7f4f41 73dc3b7 d7f4f41 45754dd d7f4f41 2caf077 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
---
language:
- en
- zh
- ja
- fr
- de
- ko
library_name: transformers
license: apache-2.0
pipeline_tag: audio-to-audio
tags:
- Speech-Tokenizer
- Text-to-Speech
---
# 🚀 TaDiCodec
This model was presented in the paper [TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling](https://huggingface.co/papers/2508.16790).
Project page: [https://tadicodec.github.io/](https://tadicodec.github.io/)
GitHub Repository: [https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer](https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer)
## Abstract
Speech tokenizers serve as foundational components for speech language models, yet current designs exhibit several limitations, including: 1) dependence on multi-layer residual vector quantization structures or high frame rates, 2) reliance on auxiliary pre-trained models for semantic distillation, and 3) requirements for complex two-stage training processes. In this work, we introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a novel approach designed to overcome these challenges. TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of 0.0875 kbps with a single-layer codebook for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS). Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and obviating the need for auxiliary pre-trained models. We also validate the compatibility of TaDiCodec in language model based zero-shot text-to-speech with both autoregressive modeling and masked generative modeling, demonstrating its effectiveness and efficiency for speech language modeling, as well as a significantly small reconstruction-generation gap. We will open source our code and model checkpoints. Audio samples are are available at https:/tadicodec.github.io/ . We release code and model checkpoints at https:/github.com/HeCheng0625/Diffusion-Speech-Tokenizer .
[](https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer)
[](https://arxiv.org/abs/2508.16790)
[](https://tadicodec.github.io/)
[](https://www.python.org/)
[](https://pytorch.org/)
[](https://huggingface.co/amphion/TaDiCodec)
# 🤗 Pre-trained Models
## 📦 Model Zoo - Ready to Use!
*Download our pre-trained models for instant inference*
## 🎵 TaDiCodec
| Model | 🤗 Hugging Face | 👷 Status |
|:-----:|:---------------:|:------:|
| **🚀 TaDiCodec** | [](https://huggingface.co/amphion/TaDiCodec) | ✅ |
| **🚀 TaDiCodec-old** | [](https://huggingface.co/amphion/TaDiCodec-old) | 🚧 |
*Note: TaDiCodec-old is the old version of TaDiCodec, the TaDiCodec-TTS-AR-Phi-3.5-4B is based on TaDiCodec-old.*
## 🎤 TTS Models
| Model | Type | LLM | 🤗 Hugging Face | 👷 Status |
|:-----:|:----:|:---:|:---------------:|:-------------:|
| **🤖 TaDiCodec-TTS-AR-Qwen2.5-0.5B** | AR | Qwen2.5-0.5B-Instruct | [](https://huggingface.co/amphion/TaDiCodec-TTS-AR-Qwen2.5-0.5B) | ✅ |
| **🤖 TaDiCodec-TTS-AR-Qwen2.5-3B** | AR | Qwen2.5-3B-Instruct | [](https://huggingface.co/amphion/TaDiCodec-TTS-AR-Qwen2.5-3B) | ✅ |
| **🤖 TaDiCodec-TTS-AR-Phi-3.5-4B** | AR | Phi-3.5-mini-instruct | [](https://huggingface.co/amphion/TaDiCodec-TTS-AR-Phi-3.5-4B) | 🚧 |
| **🌊 TaDiCodec-TTS-MGM** | MGM | - | [](https://huggingface.co/amphion/TaDiCodec-TTS-MGM) | ✅ |
## 🔧 Quick Model Usage
```python
# 🤗 Load from Hugging Face
from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
from models.tts.llm_tts.inference_mgm_tts import MGMInferencePipeline
# Load TaDiCodec tokenizer, it will automatically download the model from Hugging Face for the first time
tokenizer = TaDiCodecPipline.from_pretrained("amphion/TaDiCodec")
# Load AR TTS model, it will automatically download the model from Hugging Face for the first time
tts_model = TTSInferencePipeline.from_pretrained("amphion/TaDiCodec-TTS-AR-Qwen2.5-3B")
# Load MGM TTS model, it will automatically download the model from Hugging Face for the first time
tts_model = MGMInferencePipeline.from_pretrained("amphion/TaDiCodec-TTS-MGM")
```
# 🚀 Quick Start
## Installation
```bash
# Clone the repository
git clone https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer
# Install dependencies
bash env.sh
```
## Basic Usage
**Please refer to the [use_examples](https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer/tree/main/use_examples) folder for more detailed usage examples.**
### Speech Tokenization and Reconstruction
```python
# Example: Using TaDiCodec for speech tokenization
import torch
import soundfile as sf
from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = TaDiCodecPipline.from_pretrained(ckpt_dir="./ckpt/TaDiCodec", device=device)
# Text of the prompt audio
prompt_text = "In short, we embarked on a mission to make America great again, for all Americans."
# Text of the target audio
target_text = "But to those who knew her well, it was a symbol of her unwavering determination and spirit."
# Input audio path of the prompt audio
prompt_speech_path = "./use_examples/test_audio/trump_0.wav"
# Input audio path of the target audio
speech_path = "./use_examples/test_audio/trump_1.wav"
rec_audio = pipe(
text=target_text,
speech_path=speech_path,
prompt_text=prompt_text,
prompt_speech_path=prompt_speech_path
)
sf.write("./use_examples/test_audio/trump_rec.wav", rec_audio, 24000)
```
### Zero-shot TTS with TaDiCodec
```python
import torch
import soundfile as sf
from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
# from models.tts.llm_tts.inference_mgm_tts import MGMInferencePipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Create AR TTS pipeline
pipeline = TTSInferencePipeline.from_pretrained(
tadicodec_path="./ckpt/TaDiCodec",
llm_path="./ckpt/TaDiCodec-TTS-AR-Qwen2.5-3B",
device=device,
)
# Inference on single sample, you can also use the MGM TTS pipeline
audio = pipeline(
text="但是 to those who 知道 her well, it was a 标志 of her unwavering 决心 and spirit.", # code-switching cases are supported
prompt_text="In short, we embarked on a mission to make America great again, for all Americans.",
prompt_speech_path="./use_examples/test_audio/trump_0.wav",
)
sf.write("./use_examples/test_audio/lm_tts_output.wav", audio, 24000)
```
# 📚 Citation
If you find this repository useful, please cite our paper:
TaDiCodec:
```bibtex
@article{tadicodec2025,
title={TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling},
author={Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, Zhizheng Wu},
journal={arXiv preprint},
year={2025},
url={https://arxiv.org/abs/2508.16790}
}
```
Amphion:
```bibtex
@inproceedings{amphion,
author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
year={2024}
}
```
MaskGCT:
```bibtex
@inproceedings{wang2024maskgct,
author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
booktitle = {{ICLR}},
publisher = {OpenReview.net},
year = {2025}
}
```
# 🙏 Acknowledgments
- **MGM-based TTS** is built upon [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct).
- **Vocos vocoder** is built upon [Vocos](https://github.com/gemelo-ai/vocos).
- **NAR Llama-style transformers** is built upon [transformers](https://github.com/huggingface/transformers).
- **(Binary Spherical Quantization) BSQ** is built upon [vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch) and [bsq-vit](https://github.com/zhaoyue-zephyrus/bsq-vit).
- **Training codebase** is built upon [Amphion](https://github.com/open-mmlab/Amphion) and [accelerate](https://github.com/huggingface/accelerate). |