--- license: mit datasets: - Bingsu/KSS_Dataset - Bingsu/zeroth-korean - neoALI/Handicap-Speech language: - ko metrics: - cer base_model: - openai/whisper-large-v3-turbo pipeline_tag: automatic-speech-recognition --- # Phoneme-based Speech Recognition Experiments This section covers experiments with phoneme-level approaches for common and handicapped speech recognition using Whisper encoder representations. --- ## 4.4 Phoneme Classification Head ### Method Added simple classification heads to frozen Whisper encoder outputs for direct phoneme prediction. Tested various architectures from linear layers to BiLSTM networks. Used Montreal Forced Aligner (MFA) for phoneme target generation. ### Results | Architecture | Parameters | Test CER (%) | Training State | |-------------|------------|--------------|----------------| | 4 FC layers (frozen encoder) | 1.2M | 104.11 | Underfitting | | **3 BiLSTM + 4 FC (frozen)** | **21M** | **39.77** | **Best** | | 3 BiLSTM + 4 FC (trainable) | 656M | 99.93 | Overfitting | **Key Finding**: Frozen encoder consistently outperforms trainable encoder, achieving ~40% CER with BiLSTM architecture. --- ## 4.5 Phoneme Decoder ### Method Trained phoneme-aware decoder with custom tokenizer on subword phoneme tokens. Maintained pretrained decoder weights for stability. ### Results | Version | Configuration | Test CER (%) | Notes | |---------|--------------|--------------|-------| | v10 | Baseline (frozen encoder) | 11.66 | Strong baseline | | v13 | Encoder trainable | 78.95 | Catastrophic failure | | **v17** | **Complex vowels + regularization** | **11.78** | **Optimal** | **Key Finding**: Encoder training causes severe degradation; frozen encoder with proper regularization achieves ~12% CER. --- ## 4.6 Dual Decoder Network ### Method Novel architecture with separate P-GPT (phoneme) and S-GPT (syllable) decoders attached to frozen Whisper encoder. Tested various training strategies and architectural modifications. ### Results #### Training Strategy Comparison | Version | Configuration | Phoneme CER (%) | Syllable CER (%) | |---------|--------------|-----------------|------------------| | v4 | λ-weighted + alignment loss | 11.84 | 13.04 | | v8 | Ground truth text training | 3.82 | 4.36 | | **v17** | **Optimized tokenization** | **2.52** | **2.79** | | v23 | Top-4 encoder trainable | 6.04 | 70.37 | #### Architecture Ablations | Component | Phoneme CER (%) | Impact | |-----------|-----------------|---------| | **Baseline (32 layers)** | **2.52** | **Reference** | | Simple embedding replacement | 80.13 | Catastrophic | | 24 layers + pretrained | 93.20 | Severe degradation | | 24 layers from scratch | 8.81 | Acceptable | #### Multi-stage Training | Stage | Strategy | Phoneme CER (%) | Syllable CER (%) | |-------|----------|-----------------|------------------| | End-to-end | Baseline | 2.52 | 2.79 | | **Multi-stage** | **Sequential training** | **1.96** | **2.02** | **Improvement**: 22% phoneme, 28% syllable error reduction #### Time Normalization Algorithm | Version | Amplification Factor | Normal Spec CER (%) | |-------------------------|----------------------|---------------------| | Original + Noise Augment| - | 9.86 | | Original | - | 11.51 | | 1 | 2 | 9.93 | | Random | Random | 9.97 | | 8 (Updated Algo) | 2 | 12.77 | | 9 (8 + Random Noise) | 2 | 15.33 | | 10 | Random | 14.07 | **Improvement**: 9.93% CER from original 11.51 for handicapped speaker --- ## Key Insights ### 🔒 **Frozen Encoder Principle** All experiments confirm that **frozen Whisper encoder** dramatically outperforms trainable encoder across all architectures: - Classification head: 39.77% vs 99.93% CER - Phoneme decoder: 11.78% vs 78.95% CER - Dual decoder: 2.52% vs 70.37% CER ### 🏆 **Best Performance** **Dual decoder with multi-stage training** achieves: - **1.96% phoneme CER** - **2.02% syllable CER** - Represents state-of-the-art for phoneme-level handicapped speech recognition ### ⚠️ **Critical Dependencies** - **Full architecture required**: Removing encoder layers causes severe degradation - **Pretrained weights essential**: Simple embeddings cannot replace transformer encoder - **Text-based training**: Ground truth text outperforms phoneme conversions ### 📊 **Performance Hierarchy** 1. **Dual Decoder** (1.96% CER) - Best overall 2. **Phoneme Decoder** (11.78% CER) - Good balance 3. **Classification Head** (39.77% CER) - Simplest approach --- ## Technical Notes - Cross-fold validation shows high variance (39-67% CER), indicating speaker dependency - Attention mechanisms cause training instability in classification tasks - Residual connections crucial for expert-based architectures - Proper tokenization and label consistency are critical for CTC training