DistilCodec-v1.0 / README.md

Update README.md

73f60c1 verified 5 months ago

4.83 kB

	---
	license: cc-by-nc-4.0
	---
	# DistilCodec
	The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on uniersal audio.


	[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2408.16532)
	[![model](https://img.shields.io/badge/%F0%9F%A4%97%20DistilCodec-Models-blue)](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0)


	# 🔥 News
	- 2025.05.25: We release the code of DistilCodec-v1.0, including training and inference.
	- 2025.05.23: We release UniTTS and DistilCodec on [arxiv](https://arxiv.org/abs/2408.16532).

	## Introduction of DistilCodec
	The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework
	similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure,
	while the vector quantization module implements the GRFVQ scheme. The decoder
	employs a ConvTranspose1d based architectural configuration similar to HiFiGAN. The training methodol-
	ogy of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of
	discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi-
	STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec:
	<img src="./figure.jpg" alt="The Architecture of DistilCodec" style="width: 100%; height: auto;" />
	Distribution of DistilCodec training data is shown in below table:
	\| Data Category \| Data Size (in hours) \|
	\|-----------------------------\|--------------------------\|
	\| Chinese Audiobook \| 38000 \|
	\| Chinese Common Audio \| 20000 \|
	\| English Speech \| 40000 \|
	\| Music \| 2000 \|
	\| Total \| 100000 \|

	## Inference of DistilCodec
	The code is in [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec).

	### Part1: Generating discrete audio tokens from DistilCodec

	```python

	from distil_codec import DistilCodec, demo_for_generate_audio_codes

	codec_model_config_path='/path/to/distilcodec/model_config.json'
	codec_ckpt_path = '/path/to/distilcodec_ckpt'
	step=204000

	codec = DistilCodec.from_pretrained(
	config_path=codec_model_config_path,
	model_path=codec_ckpt_path,
	load_steps=step,
	use_generator=True,
	is_debug=False).eval()

	audio_path = '/path/to/audio_file'
	audio_tokens = demo_for_generate_audio_codes(
	codec,
	audio_path,
	target_sr=24000,
	plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
	)
	print(audio_tokens)

	```

	### Part2: Reconstruct audio from raw audio
	```python

	from distil_codec import DistilCodec, demo_for_generate_audio_codes

	codec_model_config_path='/path/to/distilcodec/model_config.json'
	codec_ckpt_path = '/path/to/distilcodec_ckpt'
	step=204000

	codec = DistilCodec.from_pretrained(
	config_path=codec_model_config_path,
	model_path=codec_ckpt_path,
	load_steps=step,
	use_generator=True,
	is_debug=False).eval()

	audio_path = '/path/to/audio_file'
	audio_tokens = demo_for_generate_audio_codes(
	codec,
	audio_path,
	target_sr=24000,
	plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
	)
	print(audio_tokens)

	# Generated audio save path, the path is f'{gen_audio_save_path}/{audio_name}.wav'
	gen_audio_save_path = '/path/to/audio_save_path'
	audio_name = 'audio_name'
	y_gen = codec.decode_from_codes(
	audio_tokens,
	minus_token_offset=True # if the 'plus_llm_offset' of method demo_for_generate_audio_codes is set to True, then minus_token_offset must be True.
	)
	codec.save_wav(
	audio_gen_batch=y_gen,
	nhop_lengths=[y_gen.shape[-1]],
	save_path=gen_audio_save_path,
	name_tag=audio_name
	)

	```

	## Available DistilCodec models
	\|Model Version\| Huggingface \| Corpus \| Token/s \| Domain \|
	\|-----------------------\|---------\|---------------\|---------------\|-----------------------------------\|
	\| DistilCodec-v1.0 \| [HF](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0) \| Universal Audio \| 93 \| Audiobook、Speech、Audio Effects \|


	## Citation

	If you find this code useful in your research, please cite our work:

	```
	@article{wang2025unitts,
	title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information},
	author={Rui Wang,Qianguo Sun,Tianrong Chen,Zhiyun Zeng,Junlong Wu,Jiaxing Zhang},
	journal={arXiv preprint arXiv:2408.16532},
	year={2025}
	}
	```