|
--- |
|
license: cc-by-nc-4.0 |
|
--- |
|
<div align="center"> |
|
<h1> |
|
DistilCodec |
|
</h1> |
|
<p> |
|
<b><em>DistilCodec: A Single Codebook Audio Codec For Universal Audio</em></b> |
|
</p> |
|
<p> |
|
</p> |
|
</p> |
|
<a href="https://arxiv.org/abs/2505.17426" style="color:red">Paper </a> | |
|
<a href="https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0" style="color:#FFD700">HuggingFace Model</a> | |
|
<a href="https://github.com/IDEA-Emdoor-Lab/DistilCodec" style="color:gray">Code</a> |
|
<p> |
|
<img src="./idea_logo.png" alt="Institution 1" style="width: 200px; height: 60px;"> |
|
</p> |
|
<p> |
|
<img src="./yidao_logo.png" alt="Institution 2" style="width: 200px; height: 60px;"> |
|
<img src="./yijiayiban.png" alt="Institution 3" style="width: 200px; height: 60px;"> |
|
</p> |
|
</div> |
|
|
|
|
|
# 🔥 News |
|
- *2025.05.26*: We release DistilCodec-v1.0 checkpoint on [huggingface](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0). |
|
- *2025.05.26*: The paper is available on [arxiv](https://arxiv.org/abs/2505.17426). |
|
- *2025.05.23*: We submit paper to arxiv. |
|
|
|
## Introduction of DistilCodec |
|
The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., and Shenzhen Yijiayiban Information Technology Co., Ltd, has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on uniersal audio.The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework |
|
similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure, |
|
while the vector quantization module implements the GRFVQ scheme. The decoder |
|
employs a ConvTranspose1d based architectural configuration similar to HiFiGAN. The training methodol- |
|
ogy of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of |
|
discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi- |
|
STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec: |
|
<img src="./figure.jpg" alt="The Architecture of DistilCodec" style="width: 100%; height: auto;" /> |
|
Distribution of DistilCodec training data is shown in below table: |
|
| **Data Category** | **Data Size (in hours)** | |
|
|-----------------------------|--------------------------| |
|
| Chinese Audiobook | 38000 | |
|
| Chinese Common Audio | 20000 | |
|
| English Audiobook | 10000 | |
|
| English Speech | 30000 | |
|
| Music | 2000 | |
|
| **Total** | **100000** | |
|
|
|
## Inference of DistilCodec |
|
The code is in github [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec). |
|
|
|
### Part1: Generating discrete audio tokens from DistilCodec |
|
|
|
```python |
|
|
|
from distil_codec import DistilCodec, demo_for_generate_audio_codes |
|
|
|
codec_model_config_path='/path/to/distilcodec/model_config.json' |
|
codec_ckpt_path = '/path/to/distilcodec_ckpt' |
|
step=204000 |
|
|
|
codec = DistilCodec.from_pretrained( |
|
config_path=codec_model_config_path, |
|
model_path=codec_ckpt_path, |
|
load_steps=step, |
|
use_generator=True, |
|
is_debug=False).eval() |
|
|
|
audio_path = '/path/to/audio_file' |
|
audio_tokens = demo_for_generate_audio_codes( |
|
codec, |
|
audio_path, |
|
target_sr=24000, |
|
plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B. |
|
) |
|
print(audio_tokens) |
|
|
|
``` |
|
|
|
### Part2: Reconstruct audio from raw audio |
|
```python |
|
|
|
from distil_codec import DistilCodec, demo_for_generate_audio_codes |
|
|
|
codec_model_config_path='/path/to/distilcodec/model_config.json' |
|
codec_ckpt_path = '/path/to/distilcodec_ckpt' |
|
step=204000 |
|
|
|
codec = DistilCodec.from_pretrained( |
|
config_path=codec_model_config_path, |
|
model_path=codec_ckpt_path, |
|
load_steps=step, |
|
use_generator=True, |
|
is_debug=False).eval() |
|
|
|
audio_path = '/path/to/audio_file' |
|
audio_tokens = demo_for_generate_audio_codes( |
|
codec, |
|
audio_path, |
|
target_sr=24000, |
|
plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B. |
|
) |
|
print(audio_tokens) |
|
|
|
# Generated audio save path, the path is f'{gen_audio_save_path}/{audio_name}.wav' |
|
gen_audio_save_path = '/path/to/audio_save_path' |
|
audio_name = 'audio_name' |
|
y_gen = codec.decode_from_codes( |
|
audio_tokens, |
|
minus_token_offset=True # if the 'plus_llm_offset' of method demo_for_generate_audio_codes is set to True, then minus_token_offset must be True. |
|
) |
|
codec.save_wav( |
|
audio_gen_batch=y_gen, |
|
nhop_lengths=[y_gen.shape[-1]], |
|
save_path=gen_audio_save_path, |
|
name_tag=audio_name |
|
) |
|
|
|
``` |
|
|
|
## Available DistilCodec models |
|
|Model Version| Huggingface | Corpus | Token/s | Domain | |
|
|-----------------------|---------|---------------|---------------|-----------------------------------| |
|
| DistilCodec-v1.0 | [HuggingFace](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0) | Universal Audio | 93 | Universal Audio | |
|
|
|
|
|
## Citation |
|
|
|
If you find our work useful in your research, please cite our work: |
|
|
|
``` |
|
@misc{wang2025unittsendtoendttsdecoupling, |
|
title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information}, |
|
author={Rui Wang and Qianguo Sun and Tianrong Chen and Zhiyun Zeng and Junlong Wu and Jiaxing Zhang}, |
|
year={2025}, |
|
eprint={2505.17426}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.SD}, |
|
url={https://arxiv.org/abs/2505.17426}, |
|
} |
|
``` |
|
|
|
|
|
## Disclaimer |
|
|
|
DistilCodec provides the capability of universal audio discretion only for academic research purposes. We encourage the community to uphold safety and ethical principles in AI research and applications. |
|
|
|
Important Notes: |
|
|
|
- Compliance with the model's open-source license is mandatory. |
|
|
|
- Unauthorized voice replication applications are strictly prohibited. |
|
|
|
- Developers bear no responsibility for any misuse of this model. |
|
|
|
|
|
## License |
|
<a href="https://arxiv.org/abs/2505.17426">UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information</a> © 2025 by <a href="https://creativecommons.org">Rui Wang, Qianguo Sun, Tianrong Chen, Zhiyun Zeng, Junlong Wu, Jiaxing Zhang</a> is licensed under <a href="https://creativecommons.org/licenses/by-nc-nd/4.0/">CC BY-NC-ND 4.0</a><img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/nd.svg" style="max-width: 1em;max-height:1em;margin-left: .2em;"> |