|
--- |
|
license: cc-by-nc-4.0 |
|
--- |
|
# DistilCodec |
|
The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on uniersal audio. |
|
|
|
|
|
[](https://arxiv.org/abs/2408.16532) |
|
[](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0) |
|
|
|
|
|
# 🔥 News |
|
- *2025.05.25*: We release the code of DistilCodec-v1.0, including training and inference. |
|
- *2025.05.23*: We release UniTTS and DistilCodec on [arxiv](https://arxiv.org/abs/2408.16532). |
|
|
|
## Introduction of DistilCodec |
|
The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework |
|
similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure, |
|
while the vector quantization module implements the GRFVQ scheme. The decoder |
|
employs a ConvTranspose1d based architectural configuration similar to HiFiGAN. Detailed |
|
network specifications and layer configurations are provided in Appendix A.1 The training methodol- |
|
ogy of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of |
|
discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi- |
|
STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec: |
|
 |
|
Distribution of DistilCodec training data is shown in below table: |
|
| **Data Category** | **Data Size (in hours)** | |
|
|-----------------------------|--------------------------| |
|
| Chinese Audiobook | 38000 | |
|
| Chinese Common Audio | 20000 | |
|
| English Audio | 40000 | |
|
| Music | 2000 | |
|
| **Total** | **100000** | |
|
|
|
## Inference of DistilCodec |
|
The code is in [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec). |
|
|
|
### Part1: Generating discrete codecs |
|
|
|
```python |
|
|
|
from distil_codec import DistilCodec, demo_for_generate_audio_codes |
|
|
|
codec_model_config_path='path_to_model_config' |
|
codec_ckpt_path = 'path_to_codec_ckpt_path' |
|
step=204000 |
|
|
|
codec = DistilCodec.from_pretrained( |
|
config_path=codec_model_config_path, |
|
model_path=codec_ckpt_path, |
|
load_steps=step, |
|
use_generator=True, |
|
is_debug=False).eval() |
|
|
|
audio_path = 'path_to_audio' |
|
audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000) |
|
print(audio_tokens) |
|
|
|
``` |
|
|
|
### Part2: Reconstruct audio from raw wav |
|
```python |
|
|
|
from distil_codec import DistilCodec, demo_for_generate_audio_codes |
|
|
|
codec_model_config_path='path_to_model_config' |
|
codec_ckpt_path = 'path_to_codec_ckpt_path' |
|
step=204000 |
|
|
|
codec = DistilCodec.from_pretrained( |
|
config_path=codec_model_config_path, |
|
model_path=codec_ckpt_path, |
|
load_steps=step, |
|
use_generator=True, |
|
is_debug=False).eval() |
|
|
|
audio_path = 'path_to_audio' |
|
audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000) |
|
print(audio_tokens) |
|
|
|
# Setup generated audio save path, the path is f'{gen_audio_save_path}/audio_name.wav' |
|
gen_audio_save_path = 'path_to_save_path' |
|
audio_name = 'your_audio_name' |
|
y_gen = codec.decode_from_codes(audio_tokens, minus_token_offset=True) |
|
codec.save_wav( |
|
audio_gen_batch=y_gen, |
|
nhop_lengths=[y_gen.shape[-1]], |
|
save_path=gen_audio_save_path, |
|
name_tag=audio_name |
|
) |
|
|
|
``` |
|
|
|
## Available DistilCodec models |
|
🤗 links to the Huggingface model hub. |
|
|Model Version| Huggingface | Corpus | Token/s | Domain | Open-Source | |
|
|-----------------------|---------|---------------|---------------|-----------------------------------|---------------| |
|
| DistilCodec-v1.0 | [🤗](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0) | Universal Audio | 93 | Audiobook、Speech、Audio Effects | √ | |
|
|
|
## References |
|
The overall training pipeline of DistilCodec draws inspiration from AcademiCodec, while its encoder and decoder design is adapted from fish-speech. The Vector Quantization (VQ) component implements GRFVQ using the vector-quantize-pytorch framework. These three exceptional works have provided invaluable assistance in our implementation of DistilCodec. Below are links to these reference projects: |
|
|
|
[1][vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch) |
|
|
|
[2][AcademiCodec](https://github.com/moewiee/hificodec) |
|
|
|
[3][fish-speech](https://github.com/fishaudio/fish-speech) |
|
|
|
|
|
## Citation |
|
|
|
If you find this code useful in your research, please cite our work: |
|
|
|
``` |
|
@article{wang2025unitts, |
|
title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information}, |
|
author={Rui Wang,Qianguo Sun,Tianrong Chen,Zhiyun Zeng,Junlong Wu,Jiaxing Zhang}, |
|
journal={arXiv preprint arXiv:2408.16532}, |
|
year={2025} |
|
} |
|
``` |