DistilCodec-v1.0 / README.md
Ray0323's picture
Update README.md
73f60c1 verified
|
raw
history blame
4.83 kB
metadata
license: cc-by-nc-4.0

DistilCodec

The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on uniersal audio.

arXiv model

🔥 News

  • 2025.05.25: We release the code of DistilCodec-v1.0, including training and inference.
  • 2025.05.23: We release UniTTS and DistilCodec on arxiv.

Introduction of DistilCodec

The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure, while the vector quantization module implements the GRFVQ scheme. The decoder employs a ConvTranspose1d based architectural configuration similar to HiFiGAN. The training methodol- ogy of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi- STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec: The Architecture of DistilCodec Distribution of DistilCodec training data is shown in below table:

Data Category Data Size (in hours)
Chinese Audiobook 38000
Chinese Common Audio 20000
English Speech 40000
Music 2000
Total 100000

Inference of DistilCodec

The code is in DistilCodec.

Part1: Generating discrete audio tokens from DistilCodec


from distil_codec import DistilCodec, demo_for_generate_audio_codes

codec_model_config_path='/path/to/distilcodec/model_config.json'
codec_ckpt_path = '/path/to/distilcodec_ckpt'
step=204000

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    load_steps=step,
    use_generator=True,
    is_debug=False).eval()

audio_path = '/path/to/audio_file'
audio_tokens = demo_for_generate_audio_codes(
    codec, 
    audio_path, 
    target_sr=24000, 
    plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
)
print(audio_tokens)

Part2: Reconstruct audio from raw audio


from distil_codec import DistilCodec, demo_for_generate_audio_codes

codec_model_config_path='/path/to/distilcodec/model_config.json'
codec_ckpt_path = '/path/to/distilcodec_ckpt'
step=204000

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    load_steps=step,
    use_generator=True,
    is_debug=False).eval()

audio_path = '/path/to/audio_file'
audio_tokens = demo_for_generate_audio_codes(
    codec, 
    audio_path, 
    target_sr=24000, 
    plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
)
print(audio_tokens)

# Generated audio save path, the path is f'{gen_audio_save_path}/{audio_name}.wav'
gen_audio_save_path = '/path/to/audio_save_path'
audio_name = 'audio_name'
y_gen = codec.decode_from_codes(
    audio_tokens, 
    minus_token_offset=True # if the 'plus_llm_offset' of method demo_for_generate_audio_codes is set to True, then minus_token_offset must be True.
)
codec.save_wav(
    audio_gen_batch=y_gen, 
    nhop_lengths=[y_gen.shape[-1]], 
    save_path=gen_audio_save_path,
    name_tag=audio_name
)

Available DistilCodec models

Model Version Huggingface Corpus Token/s Domain
DistilCodec-v1.0 HF Universal Audio 93 Audiobook、Speech、Audio Effects

Citation

If you find this code useful in your research, please cite our work:

@article{wang2025unitts,
  title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information},
  author={Rui Wang,Qianguo Sun,Tianrong Chen,Zhiyun Zeng,Junlong Wu,Jiaxing Zhang},
  journal={arXiv preprint arXiv:2408.16532},
  year={2025}
}