mahmoudmamdouh13 commited on
Commit
da36d7e
·
verified ·
1 Parent(s): b94e3e8

End of training

Browse files
Files changed (1) hide show
  1. README.md +45 -56
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  library_name: transformers
3
  license: bsd-3-clause
4
- base_model: MIT/ast-finetuned-speech-commands-v2
5
  tags:
6
  - generated_from_trainer
7
  datasets:
@@ -11,7 +11,7 @@ metrics:
11
  - recall
12
  - f1
13
  model-index:
14
- - name: ast-finetuned-speech-commands-v2-finetuned-keyword-spotting-finetuned-keyword-spotting
15
  results:
16
  - task:
17
  name: Audio Classification
@@ -25,79 +25,68 @@ model-index:
25
  metrics:
26
  - name: Precision
27
  type: precision
28
- value: 0.9861935383961439
29
  - name: Recall
30
  type: recall
31
- value: 0.9861649413727126
32
  - name: F1
33
  type: f1
34
- value: 0.9861100898918743
35
  ---
36
 
37
- # Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands
 
38
 
39
- ## Model Details
40
- - **Model name:** `ast-mlcommons-speech-commands`
41
- - **Architecture:** Audio Spectrogram Transformer (AST)
42
- - **Base pre-trained checkpoint:** MIT AST fine-tuned on Google Speech Commands v0.02
43
- - **Fine-tuning dataset:** Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with `_silence_` and `_unknown_` categories sampled from Google Speech Commands v0.02
44
- - **License:** bsd-3-clause
45
 
 
 
 
 
 
 
46
 
 
47
 
48
- ## Model Inputs and Outputs
49
- - **Input:** 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
50
- - **Output:** Softmax over 80 classes (indices 0–79). Classes mapping:
51
- ```json
52
- {
53
- "0": "_silence_",
54
- "1": "_unknown_",
55
- "2": "air",
56
- // ... 3–9 omitted for brevity ...
57
- "9": "cake",
58
- "10": "car",
59
- // ... up to 79: "zoo"
60
- }
61
 
62
- ## Training Data
63
 
64
- * Total samples: \~145,005 utterances
65
- * **Sources:**
66
 
67
- * MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
68
- * Google Speech Commands v0.02 for silence and unknown categories
69
- * **Preprocessing:**
70
 
71
- * Resampling to 16 kHz
72
- * Fixed-length one-second windows with zero-padding or cropping
73
 
74
- ## Evaluation Results
75
 
76
- | Metric | Value |
77
- | --------- | ------ |
78
- | Loss | 0.0685 |
79
- | Precision | 0.9862 |
80
- | Recall | 0.9862 |
81
- | F1-score | 0.9861 |
82
 
83
- ## Intended Uses and Limitations
 
 
 
 
 
 
 
 
 
84
 
85
- * **Suitable for:**
86
 
87
- * Real-time keyword spotting on-device
88
- * Low-latency voice command detection in noisy environments
89
- * **Limitations:**
 
 
 
 
90
 
91
- * May misclassify under unseen noise conditions or heavy accents
92
- * `_unknown_` class may not cover all out-of-vocabulary words; false positives possible
93
- * Performance may degrade on dialects or languages underrepresented in training
94
 
95
- ## Citation
96
 
97
- ```bibtex
98
- @inproceedings{gong2021ast,
99
- title={AST: Audio Spectrogram Transformer},
100
- author={Gong, Yufei and Tian, Wei and Shen, Ding and Ermon, Stefano and Liu, Fei and Lazebnik, Svetlana},
101
- booktitle={ICASSP},
102
- year={2022}
103
- }
 
1
  ---
2
  library_name: transformers
3
  license: bsd-3-clause
4
+ base_model: MIT/ast-finetuned-audioset-12-12-0.447
5
  tags:
6
  - generated_from_trainer
7
  datasets:
 
11
  - recall
12
  - f1
13
  model-index:
14
+ - name: ast-mlcommons-speech-commands
15
  results:
16
  - task:
17
  name: Audio Classification
 
25
  metrics:
26
  - name: Precision
27
  type: precision
28
+ value: 0.9661601051155746
29
  - name: Recall
30
  type: recall
31
+ value: 0.9662664379645511
32
  - name: F1
33
  type: f1
34
+ value: 0.9661541075893276
35
  ---
36
 
37
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
38
+ should probably proofread and complete it, then remove this comment. -->
39
 
40
+ # ast-mlcommons-speech-commands
 
 
 
 
 
41
 
42
+ This model is a fine-tuned version of [MIT/ast-finetuned-audioset-12-12-0.447](https://huggingface.co/MIT/ast-finetuned-audioset-12-12-0.447) on the audiofolder dataset.
43
+ It achieves the following results on the evaluation set:
44
+ - Loss: 0.1790
45
+ - Precision: 0.9662
46
+ - Recall: 0.9663
47
+ - F1: 0.9662
48
 
49
+ ## Model description
50
 
51
+ More information needed
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
+ ## Intended uses & limitations
54
 
55
+ More information needed
 
56
 
57
+ ## Training and evaluation data
 
 
58
 
59
+ More information needed
 
60
 
61
+ ## Training procedure
62
 
63
+ ### Training hyperparameters
 
 
 
 
 
64
 
65
+ The following hyperparameters were used during training:
66
+ - learning_rate: 5e-05
67
+ - train_batch_size: 32
68
+ - eval_batch_size: 32
69
+ - seed: 42
70
+ - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
71
+ - lr_scheduler_type: linear
72
+ - lr_scheduler_warmup_ratio: 0.1
73
+ - num_epochs: 5
74
+ - mixed_precision_training: Native AMP
75
 
76
+ ### Training results
77
 
78
+ | Training Loss | Epoch | Step | F1 | Validation Loss | Precision | Recall |
79
+ |:-------------:|:-----:|:-----:|:------:|:---------------:|:---------:|:------:|
80
+ | 0.0795 | 1.0 | 3496 | 0.9342 | 0.2169 | 0.9357 | 0.9347 |
81
+ | 0.1295 | 2.0 | 6992 | 0.9467 | 0.1728 | 0.9486 | 0.9473 |
82
+ | 0.0279 | 3.0 | 10488 | 0.9551 | 0.1717 | 0.9558 | 0.9556 |
83
+ | 0.0029 | 4.0 | 13984 | 0.9621 | 0.1733 | 0.9624 | 0.9621 |
84
+ | 0.0023 | 5.0 | 17480 | 0.1790 | 0.9662 | 0.9663 | 0.9662 |
85
 
 
 
 
86
 
87
+ ### Framework versions
88
 
89
+ - Transformers 4.51.3
90
+ - Pytorch 2.7.0+cu128
91
+ - Datasets 3.6.0
92
+ - Tokenizers 0.21.1