File size: 4,466 Bytes
1842005 944263e 5596d9a 944263e 357a7a9 bfe0b15 944263e 357a7a9 944263e 357a7a9 944263e bfe0b15 944263e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 |
---
license: apache-2.0
datasets:
- stapesai/ssi-speech-emotion-recognition
language:
- en
base_model:
- facebook/wav2vec2-base-960h
pipeline_tag: audio-classification
library_name: transformers
tags:
- emotion
- audio
- classification
- music
- facebook
---

# Speech-Emotion-Classification
> **Speech-Emotion-Classification** is a fine-tuned version of `facebook/wav2vec2-base-960h` for **multi-class audio classification**, specifically trained to detect **emotions** in speech. This model utilizes the `Wav2Vec2ForSequenceClassification` architecture to accurately classify speaker emotions from audio signals.
> \[!note]
> Wav2Vec2: Self-Supervised Learning for Speech Recognition
> [https://arxiv.org/pdf/2006.11477](https://arxiv.org/pdf/2006.11477)
```py
Classification Report:
precision recall f1-score test_support
Anger 0.8314 0.9346 0.8800 306
Calm 0.7949 0.8857 0.8378 35
Disgust 0.8261 0.8287 0.8274 321
Fear 0.8303 0.7377 0.7812 305
Happy 0.8929 0.7764 0.8306 322
Neutral 0.8423 0.9303 0.8841 287
Sad 0.7749 0.7825 0.7787 308
Surprised 0.9478 0.9478 0.9478 115
accuracy 0.8379 1999
macro avg 0.8426 0.8530 0.8460 1999
weighted avg 0.8392 0.8379 0.8367 1999
```


---
## Label Space: 8 Classes
```
Class 0: Anger
Class 1: Calm
Class 2: Disgust
Class 3: Fear
Class 4: Happy
Class 5: Neutral
Class 6: Sad
Class 7: Surprised
```
---
## Install Dependencies
```bash
pip install gradio transformers torch librosa hf_xet
```
---
## Inference Code
```python
import gradio as gr
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
import torch
import librosa
# Load model and processor
model_name = "prithivMLmods/Speech-Emotion-Classification"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
processor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
# Label mapping
id2label = {
"0": "Anger",
"1": "Calm",
"2": "Disgust",
"3": "Fear",
"4": "Happy",
"5": "Neutral",
"6": "Sad",
"7": "Surprised"
}
def classify_audio(audio_path):
# Load and resample audio to 16kHz
speech, sample_rate = librosa.load(audio_path, sr=16000)
# Process audio
inputs = processor(
speech,
sampling_rate=sample_rate,
return_tensors="pt",
padding=True
)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist()
prediction = {
id2label[str(i)]: round(probs[i], 3) for i in range(len(probs))
}
return prediction
# Gradio Interface
iface = gr.Interface(
fn=classify_audio,
inputs=gr.Audio(type="filepath", label="Upload Audio (WAV, MP3, etc.)"),
outputs=gr.Label(num_top_classes=8, label="Emotion Classification"),
title="Speech Emotion Classification",
description="Upload an audio clip to classify the speaker's emotion from voice signals."
)
if __name__ == "__main__":
iface.launch()
```
---
## Original Label
```py
"id2label": {
"0": "ANG",
"1": "CAL",
"2": "DIS",
"3": "FEA",
"4": "HAP",
"5": "NEU",
"6": "SAD",
"7": "SUR"
},
```
---
## Intended Use
`Speech-Emotion-Classification` is designed for:
* **Speech Emotion Analytics** – Analyze speaker emotions in call centers, interviews, or therapeutic sessions.
* **Conversational AI Personalization** – Adjust voice assistant responses based on detected emotion.
* **Mental Health Monitoring** – Support emotion recognition in voice-based wellness or teletherapy apps.
* **Voice Dataset Curation** – Tag or filter speech datasets by emotion for research or model training.
* **Media Annotation** – Automatically annotate podcasts, audiobooks, or videos with speaker emotion metadata. |