--- license: cc-by-nc-4.0 --- # DistilCodec The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on uniersal audio. [![arXiv](https://img.shields.io/badge/arXiv-Paper-.svg)](https://arxiv.org/abs/2408.16532) [![model](https://img.shields.io/badge/%F0%9F%A4%97%20DistilCodec-Models-blue)](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0) # 🔥 News - *2025.05.25*: We release the code of DistilCodec-v1.0, including training and inference. - *2025.05.23*: We release UniTTS and DistilCodec on [arxiv](https://arxiv.org/abs/2408.16532). ## Introduction of DistilCodec The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure, while the vector quantization module implements the GRFVQ scheme. The decoder employs a ConvTranspose1d based architectural configuration similar to HiFiGAN. Detailed network specifications and layer configurations are provided in Appendix A.1 The training methodol- ogy of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi- STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec: ![The Architecture of DistilCodec](./data/distilcodec_architecture.jpg) Distribution of DistilCodec training data is shown in below table: | **Data Category** | **Data Size (in hours)** | |-----------------------------|--------------------------| | Chinese Audiobook | 38000 | | Chinese Common Audio | 20000 | | English Audio | 40000 | | Music | 2000 | | **Total** | **100000** | ## Inference of DistilCodec The code is in [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec). ### Part1: Generating discrete codecs ```python from distil_codec import DistilCodec, demo_for_generate_audio_codes codec_model_config_path='path_to_model_config' codec_ckpt_path = 'path_to_codec_ckpt_path' step=204000 codec = DistilCodec.from_pretrained( config_path=codec_model_config_path, model_path=codec_ckpt_path, load_steps=step, use_generator=True, is_debug=False).eval() audio_path = 'path_to_audio' audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000) print(audio_tokens) ``` ### Part2: Reconstruct audio from raw wav ```python from distil_codec import DistilCodec, demo_for_generate_audio_codes codec_model_config_path='path_to_model_config' codec_ckpt_path = 'path_to_codec_ckpt_path' step=204000 codec = DistilCodec.from_pretrained( config_path=codec_model_config_path, model_path=codec_ckpt_path, load_steps=step, use_generator=True, is_debug=False).eval() audio_path = 'path_to_audio' audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000) print(audio_tokens) # Setup generated audio save path, the path is f'{gen_audio_save_path}/audio_name.wav' gen_audio_save_path = 'path_to_save_path' audio_name = 'your_audio_name' y_gen = codec.decode_from_codes(audio_tokens, minus_token_offset=True) codec.save_wav( audio_gen_batch=y_gen, nhop_lengths=[y_gen.shape[-1]], save_path=gen_audio_save_path, name_tag=audio_name ) ``` ## Available DistilCodec models 🤗 links to the Huggingface model hub. |Model Version| Huggingface | Corpus | Token/s | Domain | Open-Source | |-----------------------|---------|---------------|---------------|-----------------------------------|---------------| | DistilCodec-v1.0 | [🤗](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0) | Universal Audio | 93 | Audiobook、Speech、Audio Effects | √ | ## References The overall training pipeline of DistilCodec draws inspiration from AcademiCodec, while its encoder and decoder design is adapted from fish-speech. The Vector Quantization (VQ) component implements GRFVQ using the vector-quantize-pytorch framework. These three exceptional works have provided invaluable assistance in our implementation of DistilCodec. Below are links to these reference projects: [1][vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch) [2][AcademiCodec](https://github.com/moewiee/hificodec) [3][fish-speech](https://github.com/fishaudio/fish-speech) ## Citation If you find this code useful in your research, please cite our work: ``` @article{wang2025unitts, title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information}, author={Rui Wang,Qianguo Sun,Tianrong Chen,Zhiyun Zeng,Junlong Wu,Jiaxing Zhang}, journal={arXiv preprint arXiv:2408.16532}, year={2025} } ```