Ray0323 commited on
Commit
2a370f4
·
verified ·
1 Parent(s): 5a4507f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -3
README.md CHANGED
@@ -1,3 +1,121 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ ---
4
+ # DistilCodec
5
+ The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on uniersal audio.
6
+
7
+
8
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2408.16532)
9
+ [![model](https://img.shields.io/badge/%F0%9F%A4%97%20DistilCodec-Models-blue)](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0)
10
+
11
+
12
+ # 🔥 News
13
+ - *2025.05.25*: We release the code of DistilCodec-v1.0, including training and inference.
14
+ - *2025.05.23*: We release UniTTS and DistilCodec on [arxiv](https://arxiv.org/abs/2408.16532).
15
+
16
+ ## Introduction of DistilCodec
17
+ The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework
18
+ similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure,
19
+ while the vector quantization module implements the GRFVQ scheme. The decoder
20
+ employs a ConvTranspose1d based architectural configuration similar to HiFiGAN. Detailed
21
+ network specifications and layer configurations are provided in Appendix A.1 The training methodol-
22
+ ogy of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of
23
+ discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi-
24
+ STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec:
25
+ ![The Architecture of DistilCodec](./data/distilcodec_architecture.jpg)
26
+ Distribution of DistilCodec training data is shown in below table:
27
+ | **Data Category** | **Data Size (in hours)** |
28
+ |-----------------------------|--------------------------|
29
+ | Chinese Audiobook | 38000 |
30
+ | Chinese Common Audio | 20000 |
31
+ | English Audio | 40000 |
32
+ | Music | 2000 |
33
+ | **Total** | **100000** |
34
+
35
+ ## Inference of DistilCodec
36
+ The code is in [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec).
37
+
38
+ ### Part1: Generating discrete codecs
39
+
40
+ ```python
41
+
42
+ from distil_codec import DistilCodec, demo_for_generate_audio_codes
43
+
44
+ codec_model_config_path='path_to_model_config'
45
+ codec_ckpt_path = 'path_to_codec_ckpt_path'
46
+ step=204000
47
+
48
+ codec = DistilCodec.from_pretrained(
49
+ config_path=codec_model_config_path,
50
+ model_path=codec_ckpt_path,
51
+ load_steps=step,
52
+ use_generator=True,
53
+ is_debug=False).eval()
54
+
55
+ audio_path = 'path_to_audio'
56
+ audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000)
57
+ print(audio_tokens)
58
+
59
+ ```
60
+
61
+ ### Part2: Reconstruct audio from raw wav
62
+ ```python
63
+
64
+ from distil_codec import DistilCodec, demo_for_generate_audio_codes
65
+
66
+ codec_model_config_path='path_to_model_config'
67
+ codec_ckpt_path = 'path_to_codec_ckpt_path'
68
+ step=204000
69
+
70
+ codec = DistilCodec.from_pretrained(
71
+ config_path=codec_model_config_path,
72
+ model_path=codec_ckpt_path,
73
+ load_steps=step,
74
+ use_generator=True,
75
+ is_debug=False).eval()
76
+
77
+ audio_path = 'path_to_audio'
78
+ audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000)
79
+ print(audio_tokens)
80
+
81
+ # Setup generated audio save path, the path is f'{gen_audio_save_path}/audio_name.wav'
82
+ gen_audio_save_path = 'path_to_save_path'
83
+ audio_name = 'your_audio_name'
84
+ y_gen = codec.decode_from_codes(audio_tokens, minus_token_offset=True)
85
+ codec.save_wav(
86
+ audio_gen_batch=y_gen,
87
+ nhop_lengths=[y_gen.shape[-1]],
88
+ save_path=gen_audio_save_path,
89
+ name_tag=audio_name
90
+ )
91
+
92
+ ```
93
+
94
+ ## Available DistilCodec models
95
+ 🤗 links to the Huggingface model hub.
96
+ |Model Version| Huggingface | Corpus | Token/s | Domain | Open-Source |
97
+ |-----------------------|---------|---------------|---------------|-----------------------------------|---------------|
98
+ | DistilCodec-v1.0 | [🤗](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0) | Universal Audio | 93 | Audiobook、Speech、Audio Effects | √ |
99
+
100
+ ## References
101
+ The overall training pipeline of DistilCodec draws inspiration from AcademiCodec, while its encoder and decoder design is adapted from fish-speech. The Vector Quantization (VQ) component implements GRFVQ using the vector-quantize-pytorch framework. These three exceptional works have provided invaluable assistance in our implementation of DistilCodec. Below are links to these reference projects:
102
+
103
+ [1][vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch)
104
+
105
+ [2][AcademiCodec](https://github.com/moewiee/hificodec)
106
+
107
+ [3][fish-speech](https://github.com/fishaudio/fish-speech)
108
+
109
+
110
+ ## Citation
111
+
112
+ If you find this code useful in your research, please cite our work:
113
+
114
+ ```
115
+ @article{wang2025unitts,
116
+ title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information},
117
+ author={Rui Wang,Qianguo Sun,Tianrong Chen,Zhiyun Zeng,Junlong Wu,Jiaxing Zhang},
118
+ journal={arXiv preprint arXiv:2408.16532},
119
+ year={2025}
120
+ }
121
+ ```