license: cc-by-nc-4.0
DistilCodec
The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on uniersal audio.
🔥 News
- 2025.05.25: We release the code of DistilCodec-v1.0, including training and inference.
- 2025.05.23: We release UniTTS and DistilCodec on arxiv.
Introduction of DistilCodec
The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework
similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure,
while the vector quantization module implements the GRFVQ scheme. The decoder
employs a ConvTranspose1d based architectural configuration similar to HiFiGAN. Detailed
network specifications and layer configurations are provided in Appendix A.1 The training methodol-
ogy of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of
discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi-
STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec:
Distribution of DistilCodec training data is shown in below table:
Data Category | Data Size (in hours) |
---|---|
Chinese Audiobook | 38000 |
Chinese Common Audio | 20000 |
English Audio | 40000 |
Music | 2000 |
Total | 100000 |
Inference of DistilCodec
The code is in DistilCodec.
Part1: Generating discrete codecs
from distil_codec import DistilCodec, demo_for_generate_audio_codes
codec_model_config_path='path_to_model_config'
codec_ckpt_path = 'path_to_codec_ckpt_path'
step=204000
codec = DistilCodec.from_pretrained(
config_path=codec_model_config_path,
model_path=codec_ckpt_path,
load_steps=step,
use_generator=True,
is_debug=False).eval()
audio_path = 'path_to_audio'
audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000)
print(audio_tokens)
Part2: Reconstruct audio from raw wav
from distil_codec import DistilCodec, demo_for_generate_audio_codes
codec_model_config_path='path_to_model_config'
codec_ckpt_path = 'path_to_codec_ckpt_path'
step=204000
codec = DistilCodec.from_pretrained(
config_path=codec_model_config_path,
model_path=codec_ckpt_path,
load_steps=step,
use_generator=True,
is_debug=False).eval()
audio_path = 'path_to_audio'
audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000)
print(audio_tokens)
# Setup generated audio save path, the path is f'{gen_audio_save_path}/audio_name.wav'
gen_audio_save_path = 'path_to_save_path'
audio_name = 'your_audio_name'
y_gen = codec.decode_from_codes(audio_tokens, minus_token_offset=True)
codec.save_wav(
audio_gen_batch=y_gen,
nhop_lengths=[y_gen.shape[-1]],
save_path=gen_audio_save_path,
name_tag=audio_name
)
Available DistilCodec models
🤗 links to the Huggingface model hub.
Model Version | Huggingface | Corpus | Token/s | Domain | Open-Source |
---|---|---|---|---|---|
DistilCodec-v1.0 | 🤗 | Universal Audio | 93 | Audiobook、Speech、Audio Effects | √ |
References
The overall training pipeline of DistilCodec draws inspiration from AcademiCodec, while its encoder and decoder design is adapted from fish-speech. The Vector Quantization (VQ) component implements GRFVQ using the vector-quantize-pytorch framework. These three exceptional works have provided invaluable assistance in our implementation of DistilCodec. Below are links to these reference projects:
[2]AcademiCodec
[3]fish-speech
Citation
If you find this code useful in your research, please cite our work:
@article{wang2025unitts,
title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information},
author={Rui Wang,Qianguo Sun,Tianrong Chen,Zhiyun Zeng,Junlong Wu,Jiaxing Zhang},
journal={arXiv preprint arXiv:2408.16532},
year={2025}
}