🧠 SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper
This repository hosts SE-DiCoW, the state-of-the-art Target-Speaker ASR model developed by BUT Speech@FIT in collaboration with JHU CLSP/HLTCOE and CMU LTI.
🔧 Key Innovations
- 🔍 Self-Enrollment (SE): Automatically selects the most informative segment of the target speaker from the conversation and integrates it via cross-attention. This solves the ambiguity problem in fully overlapped regions.
- ⚡ Improved Conditioning: Introduces FDDT (Frame-Level Diarization Dependent Transformation) layers before positional embeddings for better signal modulation.
- 📉 Reduced Error: achieved ~75% relative reduction in tcpWER on Libri3Mix compared to v1.
- 🛠️ Training Stability: Uses less suppressive initialization and flexible data segmentation (no forced end-timestamps).
- 🔄 Robustness: Trained with STNO noise injection and SpecAugment to handle imperfect diarization.
⚡ Quick Usage
1. Run Interactive Demo (Gradio)
The easiest way to use this model is via the DiCoW inference repository. We provide a Gradio app that handles diarization, self-enrollment selection, and mask generation automatically:
git clone https://github.com/BUTSpeechFIT/DiCoW
cd DiCoW
python app.py
2. Load in Python
If you want to load the model manually (e.g., for custom scripts):
from transformers import AutoModelForSpeechSeq2Seq
# 1. Load the model (requires remote code for custom Self-Enrollment layers)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"BUT-FIT/SE-DiCoW",
trust_remote_code=True
)
# Note: This model requires specific conditioning (STNO masks + Enrollment Audio).
# It cannot be run with standard Whisper pipelines.
# See inference code in the GitHub repo for details.
🧬 Reproducibility & Training
This model is fully open-source and can be easily reproduced using our toolkit.
1. Data Preparation Clone the mt-asr-data-prep repository and run the setup script:
./prepare.sh --single-mic-only --root-dir /path/to/workdir
2. Training
Clone the training repository TS-ASR-Whisper and launch the experiment using the se_dicow recipe:
# Run this from the root of the TS-ASR-Whisper repository
sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=se_dicow
🏆 Performance Snapshot (tcpWER)
Metric: Time-Constrained Minimum Permutation WER (5s collar) - DiariZen Diarization
| Dataset | DiCoW v1 (Baseline) | SE-DiCoW (This Model) |
|---|---|---|
| Libri2Mix (Both) | 21.6% | 9.7% |
| LibriSpeechMix (2) | 17.9% | 3.1% |
| AMI (SDM) | 21.4% | 18.5% |
| NOTSOFAR-1 (Small-SC) | 29.8% | 26.2% |
📦 Model Details
- Base Model: Whisper large-v3-turbo
- Training Datasets: NOTSOFAR-1, AMI, LibriMix (2/3 spk), Synthetic LibriSpeech.
- Mechanism: Diarization-Conditioned + Self-Enrollment Cross-Attention.
📚 Citations
If you use this model, please cite our ICASSP 2026 and CS&L 2026 papers:
@INPROCEEDINGS{polok2026sedicow,
author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
year={2026},
}
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
year = {2026},
doi = {10.1016/j.csl.2025.101841},
author = {Alexander Polok et al.}
}
📬 Contact
- Issues: GitHub Issues
- Email: ipoloka@fit.vut.cz
- Downloads last month
- 415