vladislavbro commited on
Commit
d8ca95e
·
verified ·
1 Parent(s): 91e2829

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +272 -0
README.md ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ar
5
+ - da
6
+ - de
7
+ - el
8
+ - en
9
+ - es
10
+ - fi
11
+ - fr
12
+ - he
13
+ - hi
14
+ - it
15
+ - ja
16
+ - ko
17
+ - ms
18
+ - nl
19
+ - no
20
+ - pl
21
+ - pt
22
+ - ru
23
+ - sv
24
+ - sw
25
+ - tr
26
+ - zh
27
+ pipeline_tag: text-to-speech
28
+ tags:
29
+ - text-to-speech
30
+ - speech
31
+ - speech-generation
32
+ - voice-cloning
33
+ - multilingual-tts
34
+ library_name: chatterbox
35
+ ---
36
+
37
+ <img width="800" alt="cb-big2" src="https://github.com/user-attachments/assets/bd8c5f03-e91d-4ee5-b680-57355da204d1" />
38
+
39
+ <h1 style="font-size: 32px">Chatterbox TTS</h1>
40
+
41
+ <div style="display: flex; align-items: center; gap: 12px">
42
+ <a href="https://resemble-ai.github.io/chatterbox_demopage/">
43
+ <img src="https://img.shields.io/badge/listen-demo_samples-blue" alt="Listen to Demo Samples" />
44
+ </a>
45
+ <a href="https://huggingface.co/spaces/ResembleAI/Chatterbox">
46
+ <img src="https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg" alt="Open in HF Spaces" />
47
+ </a>
48
+ <a href="https://podonos.com/resembleai/chatterbox">
49
+ <img src="https://static-public.podonos.com/badges/insight-on-pdns-sm-dark.svg" alt="Insight on Podos" />
50
+ </a>
51
+ </div>
52
+
53
+ <div style="display: flex; align-items: center; gap: 8px;">
54
+ <span style="font-style: italic;white-space: pre-wrap">Made with ❤️ by</span>
55
+ <img width="100" alt="resemble-logo-horizontal" src="https://github.com/user-attachments/assets/35cf756b-3506-4943-9c72-c05ddfa4e525" />
56
+ </div>
57
+
58
+ **Chatterbox** [Resemble AI's](https://resemble.ai) production-grade open source TTS model. Chatterbox supports **English** out of the box. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.
59
+
60
+ Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support **emotion exaggeration control**, a powerful feature that makes your voices stand out.
61
+
62
+
63
+ # Key Details
64
+ - SoTA zeroshot English TTS
65
+ - 0.5B Llama backbone
66
+ - Unique exaggeration/intensity control
67
+ - Ultra-stable with alignment-informed inference
68
+ - Trained on 0.5M hours of cleaned data
69
+ - Watermarked outputs (optional)
70
+ - Easy voice conversion script using onnxruntime
71
+ - [Outperforms ElevenLabs](https://podonos.com/resembleai/chatterbox)
72
+
73
+ # Tips
74
+ - **General Use (TTS and Voice Agents):**
75
+ - The default settings (`exaggeration=0.5`, `cfg=0.5`) work well for most prompts.
76
+
77
+ - **Expressive or Dramatic Speech:**
78
+ - Try increase `exaggeration` to around `0.7` or higher.
79
+ - Higher `exaggeration` tends to speed up speech;
80
+
81
+
82
+ # Usage
83
+ [ONNX Export and Inference script](https://github.com/VladOS95-cyber/onnx_conversion_scripts/tree/main/chatterbox)
84
+
85
+ ```python
86
+ import onnxruntime
87
+
88
+ from huggingface_hub import hf_hub_download
89
+ from transformers import AutoTokenizer
90
+
91
+ import numpy as np
92
+ from tqdm import tqdm
93
+ import librosa
94
+ import soundfile as sf
95
+
96
+ S3GEN_SR = 24000
97
+
98
+ # Sampling rate of the inputs to S3TokenizerV2
99
+ START_SPEECH_TOKEN = 6561
100
+ STOP_SPEECH_TOKEN = 6562
101
+
102
+
103
+ class RepetitionPenaltyLogitsProcessor:
104
+ def __init__(self, penalty: float):
105
+ if not isinstance(penalty, float) or not (penalty > 0):
106
+ raise ValueError(f"`penalty` must be a strictly positive float, but is {penalty}")
107
+ self.penalty = penalty
108
+
109
+ def __call__(self, input_ids: np.ndarray, scores: np.ndarray) -> np.ndarray:
110
+ score = np.take_along_axis(scores, input_ids, axis=1)
111
+ score = np.where(score < 0, score * self.penalty, score / self.penalty)
112
+ scores_processed = scores.copy()
113
+ np.put_along_axis(scores_processed, input_ids, score, axis=1)
114
+ return scores_processed
115
+
116
+
117
+ def run_inference(
118
+ text="The Lord of the Rings is the greatest work of literature.",
119
+ target_voice_path=None,
120
+ max_new_tokens = 256,
121
+ exaggeration=0.5,
122
+ output_dir="converted",
123
+ output_file_name="output.wav",
124
+ apply_watermark=True,
125
+ ):
126
+
127
+ model_id = "onnx-community/chatterbox-onnx"
128
+ if not target_voice_path:
129
+ target_voice_path = hf_hub_download(repo_id=model_id, filename="default_voice.wav", local_dir=output_dir)
130
+
131
+ ## Load model
132
+ speech_encoder_path = hf_hub_download(repo_id=model_id, filename="speech_encoder.onnx", local_dir=output_dir, subfolder='onnx')
133
+ hf_hub_download(repo_id=model_id, filename="speech_encoder.onnx_data", local_dir=output_dir, subfolder='onnx')
134
+ embed_tokens_path = hf_hub_download(repo_id=model_id, filename="embed_tokens.onnx", local_dir=output_dir, subfolder='onnx')
135
+ hf_hub_download(repo_id=model_id, filename="embed_tokens.onnx_data", local_dir=output_dir, subfolder='onnx')
136
+ conditional_decoder_path = hf_hub_download(repo_id=model_id, filename="conditional_decoder.onnx", local_dir=output_dir, subfolder='onnx')
137
+ hf_hub_download(repo_id=model_id, filename="conditional_decoder.onnx_data", local_dir=output_dir, subfolder='onnx')
138
+ language_model_path = hf_hub_download(repo_id=model_id, filename="language_model.onnx", local_dir=output_dir, subfolder='onnx')
139
+ hf_hub_download(repo_id=model_id, filename="language_model.onnx_data", local_dir=output_dir, subfolder='onnx')
140
+
141
+ # # Start inferense sessions
142
+ speech_encoder_session = onnxruntime.InferenceSession(speech_encoder_path)
143
+ embed_tokens_session = onnxruntime.InferenceSession(embed_tokens_path)
144
+ llama_with_past_session = onnxruntime.InferenceSession(language_model_path)
145
+ cond_decoder_session = onnxruntime.InferenceSession(conditional_decoder_path)
146
+
147
+ def execute_text_to_audio_inference(text):
148
+ print("Start inference script...")
149
+
150
+ audio_values, _ = librosa.load(target_voice_path, sr=S3GEN_SR)
151
+ audio_values = audio_values[np.newaxis, :].astype(np.float32)
152
+
153
+ ## Prepare input
154
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
155
+ input_ids = tokenizer(text, return_tensors="np")["input_ids"].astype(np.int64)
156
+
157
+ position_ids = np.where(
158
+ input_ids >= START_SPEECH_TOKEN,
159
+ 0,
160
+ np.arange(input_ids.shape[1])[np.newaxis, :] - 1
161
+ )
162
+
163
+ ort_embed_tokens_inputs = {
164
+ "input_ids": input_ids,
165
+ "position_ids": position_ids,
166
+ "exaggeration": np.array([exaggeration], dtype=np.float32)
167
+ }
168
+
169
+ ## Instantiate the logits processors.
170
+ repetition_penalty = 1.2
171
+ repetition_penalty_processor = RepetitionPenaltyLogitsProcessor(penalty=repetition_penalty)
172
+
173
+ num_hidden_layers = 30
174
+ num_key_value_heads = 16
175
+ head_dim = 64
176
+
177
+ generate_tokens = np.array([[START_SPEECH_TOKEN]], dtype=np.long)
178
+
179
+ # ---- Generation Loop using kv_cache ----
180
+ for i in tqdm(range(max_new_tokens), desc="Sampling", dynamic_ncols=True):
181
+
182
+ inputs_embeds = embed_tokens_session.run(None, ort_embed_tokens_inputs)[0]
183
+ if i == 0:
184
+ ort_speech_encoder_input = {
185
+ "audio_values": audio_values,
186
+ }
187
+ cond_emb, prompt_token, ref_x_vector, prompt_feat = speech_encoder_session.run(None, ort_speech_encoder_input)
188
+ inputs_embeds = np.concatenate((cond_emb, inputs_embeds), axis=1)
189
+
190
+ ## Prepare llm inputs
191
+ batch_size, seq_len, _ = inputs_embeds.shape
192
+ past_key_values = {
193
+ f"past_key_values.{layer}.{kv}": np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
194
+ for layer in range(num_hidden_layers)
195
+ for kv in ("key", "value")
196
+ }
197
+ attention_mask = np.ones((batch_size, seq_len), dtype=np.int64)
198
+ llm_position_ids = np.cumsum(attention_mask, axis=1, dtype=np.int64) - 1
199
+
200
+ logits, *present_key_values = llama_with_past_session.run(None, dict(
201
+ inputs_embeds=inputs_embeds,
202
+ attention_mask=attention_mask,
203
+ position_ids=llm_position_ids,
204
+ **past_key_values,
205
+ ))
206
+
207
+ logits = logits[:, -1, :]
208
+ next_token_logits = repetition_penalty_processor(generate_tokens, logits)
209
+
210
+ next_token = np.argmax(next_token_logits, axis=-1, keepdims=True).astype(np.int64)
211
+ generate_tokens = np.concatenate((generate_tokens, next_token), axis=-1)
212
+ if (next_token.flatten() == STOP_SPEECH_TOKEN).all():
213
+ break
214
+
215
+ # Get embedding for the new token.
216
+ position_ids = np.full(
217
+ (input_ids.shape[0], 1),
218
+ i + 1,
219
+ dtype=np.int64,
220
+ )
221
+ ort_embed_tokens_inputs["input_ids"] = next_token
222
+ ort_embed_tokens_inputs["position_ids"] = position_ids
223
+
224
+ ## Update values for next generation loop
225
+ attention_mask = np.concatenate([attention_mask, np.ones((batch_size, 1), dtype=np.int64)], axis=1)
226
+ llm_position_ids = llm_position_ids[:, -1:] + 1
227
+ for j, key in enumerate(past_key_values):
228
+ past_key_values[key] = present_key_values[j]
229
+
230
+ speech_tokens = generate_tokens[:, 1:-1]
231
+ speech_tokens = np.concatenate([prompt_token, speech_tokens], axis=1)
232
+ return speech_tokens, ref_x_vector, prompt_feat
233
+
234
+ speech_tokens, speaker_embeddings, speaker_features = execute_text_to_audio_inference(text)
235
+ cond_incoder_input = {
236
+ "speech_tokens": speech_tokens,
237
+ "speaker_embeddings": speaker_embeddings,
238
+ "speaker_features": speaker_features,
239
+ }
240
+ wav = cond_decoder_session.run(None, cond_incoder_input)[0]
241
+ wav = np.squeeze(wav, axis=0)
242
+
243
+ # Optional: Apply watermark
244
+ if apply_watermark:
245
+ import perth
246
+ watermarker = perth.PerthImplicitWatermarker()
247
+ wav = watermarker.apply_watermark(wav, sample_rate=S3GEN_SR)
248
+
249
+ sf.write(output_file_name, wav, S3GEN_SR)
250
+ print(f"{output_file_name} was successfully saved")
251
+
252
+ if __name__ == "__main__":
253
+ run_inference(
254
+ text="Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill.",
255
+ exaggeration=0.5,
256
+ output_file_name="output.wav",
257
+ apply_watermark=False,
258
+ )
259
+ ```
260
+
261
+
262
+ # Acknowledgements
263
+ - [Xenova](https://huggingface.co/Xenova)
264
+ - [Vladislav Bronzov](https://github.com/VladOS95-cyber)
265
+ - [Resemble AI](https://github.com/resemble-ai/chatterbox)
266
+
267
+ # Built-in PerTh Watermarking for Responsible AI
268
+
269
+ Every audio file generated by Chatterbox includes [Resemble AI's Perth (Perceptual Threshold) Watermarker](https://github.com/resemble-ai/perth) - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
270
+
271
+ # Disclaimer
272
+ Don't use this model to do bad things. Prompts are sourced from freely available data on the internet.