🚨 DRAFT: X-Codec2 (Transformers-compatible version)
The X-Codec2 model was proposed in Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis.
X-Codec2 is a neural audio codec designed to improve speech synthesis and general audio generation for large language model (LLM) pipelines. It extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized, enabling efficient and high-fidelity audio representation.
About its architecture:
- Unified Semantic-Acoustic Tokenization: X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre).
- Single-Stage Feature Scalar Quantization (FSQ): Unlike the multi-layer residual VQ in most approaches (e.g., DAC, Encodec, Mimi), X-Codec2 uses a single-layer of Feature Scalar Quantization (FSQ) for stability and compatibility with causal, autoregressive LLMs.
- Transformer-Friendly Design: The 1D token structure of X-Codec2 naturally aligns with the autoregressive modeling in LLMs like LLaMA, improving training efficiency and downstream compatibility.
This checkpoint was contributed by Eric Bezzam and Steven Zheng. The original code can be found here.
Usage example
Until X-Codec2 is merged into Transformers, it can be used by installing from the following fork:
pip install git+https://github.com/ebezzam/transformers.git@add-xcodec2
Here is a quick example of how to encode and decode an audio using this model:
from datasets import Audio, load_dataset
from transformers import AutoFeatureExtractor, Xcodec2Model
model_id = "bezzam/xcodec2"
model = Xcodec2Model.from_pretrained(model_id, device_map="auto")
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audio = dataset[0]["audio"]["array"]
inputs = feature_extractor(audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(
model.device, model.dtype
)
print("Input waveform shape:", inputs["audio"].shape)
# Input waveform shape: torch.Size([1, 1, 94080])
# encoder and decoder
audio_codes = model.encode(**inputs).audio_codes
print("Audio codes shape:", audio_codes.shape)
# Audio codes shape: torch.Size([1, 1, 294])
audio_values = model.decode(audio_codes).audio_values
print("Audio values shape:", audio_values.shape)
# Audio values shape: torch.Size([1, 1, 94080])
# Equivalently, you can do encoding and decoding in one step
model_output = model(**inputs)
audio_codes = model_output.audio_codes
audio_values = model_output.audio_values
Batch processing
The original checkpoint and code via PyPI does not support batch processing, but it is possible with this version!
from datasets import Audio, load_dataset
from transformers import AutoFeatureExtractor, Xcodec2Model
batch_size = 2
model_id = "bezzam/xcodec2"
model = Xcodec2Model.from_pretrained(model_id, device_map="auto")
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audios = [dataset[i]["audio"]["array"] for i in range(batch_size)]
inputs = feature_extractor(audio=audios, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(
model.device, model.dtype
)
print("Input waveform shape:", inputs["audio"].shape)
# Input waveform shape: torch.Size([2, 1, 94080])
# encoder and decoder
encoder_output = model.encode(**inputs)
audio_codes = encoder_output.audio_codes
print("Audio codes shape:", audio_codes.shape)
# Audio codes shape: torch.Size([2, 1, 294])
audio_values = model.decode(audio_codes).audio_values
print("Audio values shape:", audio_values.shape)
# Audio values shape: torch.Size([2, 1, 94080])
# Equivalently, you can do encoding and decoding in one step
model_output = model(**inputs)
audio_codes = model_output.audio_codes
audio_values = model_output.audio_values
- Downloads last month
- 637