mahmoudmamdouh13
/

ast-mlcommons-speech-commands

@@ -1,7 +1,7 @@
 ---
 library_name: transformers
 license: bsd-3-clause
-base_model: MIT/ast-finetuned-speech-commands-v2
 tags:
 - generated_from_trainer
 datasets:
@@ -11,7 +11,7 @@ metrics:
 - recall
 - f1
 model-index:
-- name: ast-finetuned-speech-commands-v2-finetuned-keyword-spotting-finetuned-keyword-spotting
   results:
   - task:
       name: Audio Classification
@@ -25,79 +25,68 @@ model-index:
     metrics:
     - name: Precision
       type: precision
-      value: 0.9861935383961439
     - name: Recall
       type: recall
-      value: 0.9861649413727126
     - name: F1
       type: f1
-      value: 0.9861100898918743
 ---
-# Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands
-## Model Details
-- **Model name:** `ast-mlcommons-speech-commands`
-- **Architecture:** Audio Spectrogram Transformer (AST)
-- **Base pre-trained checkpoint:** MIT AST fine-tuned on Google Speech Commands v0.02
-- **Fine-tuning dataset:** Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with `_silence_` and `_unknown_` categories sampled from Google Speech Commands v0.02
-- **License:** bsd-3-clause
-## Model Inputs and Outputs
-- **Input:** 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
-- **Output:** Softmax over 80 classes (indices 0–79). Classes mapping:
-  ```json
-  {
-    "0": "_silence_",
-    "1": "_unknown_",
-    "2": "air",
-    // ... 3–9 omitted for brevity ...
-    "9": "cake",
-    "10": "car",
-    // ... up to 79: "zoo"
-  }
-## Training Data
-* Total samples: \~145,005 utterances
-* **Sources:**
-  * MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
-  * Google Speech Commands v0.02 for silence and unknown categories
-* **Preprocessing:**
-  * Resampling to 16 kHz
-  * Fixed-length one-second windows with zero-padding or cropping
-## Evaluation Results
-| Metric    | Value  |
-| --------- | ------ |
-| Loss      | 0.0685 |
-| Precision | 0.9862 |
-| Recall    | 0.9862 |
-| F1-score  | 0.9861 |
-## Intended Uses and Limitations
-* **Suitable for:**
-  * Real-time keyword spotting on-device
-  * Low-latency voice command detection in noisy environments
-* **Limitations:**
-  * May misclassify under unseen noise conditions or heavy accents
-  * `_unknown_` class may not cover all out-of-vocabulary words; false positives possible
-  * Performance may degrade on dialects or languages underrepresented in training
-## Citation
-```bibtex
-@inproceedings{gong2021ast,
-  title={AST: Audio Spectrogram Transformer},
-  author={Gong, Yufei and Tian, Wei and Shen, Ding and Ermon, Stefano and Liu, Fei and Lazebnik, Svetlana},
-  booktitle={ICASSP},
-  year={2022}
-}

 ---
 library_name: transformers
 license: bsd-3-clause
+base_model: MIT/ast-finetuned-audioset-12-12-0.447
 tags:
 - generated_from_trainer
 datasets:
 - recall
 - f1
 model-index:
+- name: ast-mlcommons-speech-commands
   results:
   - task:
       name: Audio Classification
     metrics:
     - name: Precision
       type: precision
+      value: 0.9661601051155746
     - name: Recall
       type: recall
+      value: 0.9662664379645511
     - name: F1
       type: f1
+      value: 0.9661541075893276
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# ast-mlcommons-speech-commands
+This model is a fine-tuned version of [MIT/ast-finetuned-audioset-12-12-0.447](https://huggingface.co/MIT/ast-finetuned-audioset-12-12-0.447) on the audiofolder dataset.
+It achieves the following results on the evaluation set:
+- Loss: 0.1790
+- Precision: 0.9662
+- Recall: 0.9663
+- F1: 0.9662
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 5e-05
+- train_batch_size: 32
+- eval_batch_size: 32
+- seed: 42
+- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_ratio: 0.1
+- num_epochs: 5
+- mixed_precision_training: Native AMP
+### Training results
+| Training Loss | Epoch | Step  | F1     | Validation Loss | Precision | Recall |
+|:-------------:|:-----:|:-----:|:------:|:---------------:|:---------:|:------:|
+| 0.0795        | 1.0   | 3496  | 0.9342 | 0.2169          | 0.9357    | 0.9347 |
+| 0.1295        | 2.0   | 6992  | 0.9467 | 0.1728          | 0.9486    | 0.9473 |
+| 0.0279        | 3.0   | 10488 | 0.9551 | 0.1717          | 0.9558    | 0.9556 |
+| 0.0029        | 4.0   | 13984 | 0.9621 | 0.1733          | 0.9624    | 0.9621 |
+| 0.0023        | 5.0   | 17480 | 0.1790 | 0.9662          | 0.9663    | 0.9662 |
+### Framework versions
+- Transformers 4.51.3
+- Pytorch 2.7.0+cu128
+- Datasets 3.6.0
+- Tokenizers 0.21.1