File size: 4,466 Bytes
1842005
 
 
 
944263e
 
 
 
 
 
 
 
 
5596d9a
944263e
 
357a7a9
 
bfe0b15
 
944263e
 
 
 
 
 
 
 
 
357a7a9
 
 
944263e
357a7a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
944263e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bfe0b15
944263e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
license: apache-2.0
datasets:
- stapesai/ssi-speech-emotion-recognition
language:
- en
base_model:
- facebook/wav2vec2-base-960h
pipeline_tag: audio-classification
library_name: transformers
tags:
- emotion
- audio
- classification
- music
- facebook
---

![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/J4EOXVmr-bBtxeykZTWbc.png)

# Speech-Emotion-Classification

> **Speech-Emotion-Classification** is a fine-tuned version of `facebook/wav2vec2-base-960h` for **multi-class audio classification**, specifically trained to detect **emotions** in speech. This model utilizes the `Wav2Vec2ForSequenceClassification` architecture to accurately classify speaker emotions from audio signals.

> \[!note]
> Wav2Vec2: Self-Supervised Learning for Speech Recognition
> [https://arxiv.org/pdf/2006.11477](https://arxiv.org/pdf/2006.11477)


```py
Classification Report:

              precision    recall  f1-score   test_support

       Anger       0.8314    0.9346    0.8800       306
        Calm       0.7949    0.8857    0.8378        35
     Disgust       0.8261    0.8287    0.8274       321
        Fear       0.8303    0.7377    0.7812       305
       Happy       0.8929    0.7764    0.8306       322
     Neutral       0.8423    0.9303    0.8841       287
         Sad       0.7749    0.7825    0.7787       308
  Surprised       0.9478    0.9478    0.9478       115

    accuracy                           0.8379      1999
   macro avg       0.8426    0.8530    0.8460      1999
weighted avg       0.8392    0.8379    0.8367      1999
```

![download.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/oW8Qa6MO2koMOhRQgVd6a.png)

![download (1).png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/w_wC5gmrWhNlPYS_ftYSC.png)

---

## Label Space: 8 Classes

```
Class 0: Anger  
Class 1: Calm  
Class 2: Disgust  
Class 3: Fear  
Class 4: Happy  
Class 5: Neutral  
Class 6: Sad  
Class 7: Surprised
```

---

## Install Dependencies

```bash
pip install gradio transformers torch librosa hf_xet
```

---

## Inference Code

```python
import gradio as gr
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
import torch
import librosa

# Load model and processor
model_name = "prithivMLmods/Speech-Emotion-Classification"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
processor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)

# Label mapping
id2label = {
    "0": "Anger",
    "1": "Calm",
    "2": "Disgust",
    "3": "Fear",
    "4": "Happy",
    "5": "Neutral",
    "6": "Sad",
    "7": "Surprised"
}

def classify_audio(audio_path):
    # Load and resample audio to 16kHz
    speech, sample_rate = librosa.load(audio_path, sr=16000)

    # Process audio
    inputs = processor(
        speech,
        sampling_rate=sample_rate,
        return_tensors="pt",
        padding=True
    )

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist()

    prediction = {
        id2label[str(i)]: round(probs[i], 3) for i in range(len(probs))
    }

    return prediction

# Gradio Interface
iface = gr.Interface(
    fn=classify_audio,
    inputs=gr.Audio(type="filepath", label="Upload Audio (WAV, MP3, etc.)"),
    outputs=gr.Label(num_top_classes=8, label="Emotion Classification"),
    title="Speech Emotion Classification",
    description="Upload an audio clip to classify the speaker's emotion from voice signals."
)

if __name__ == "__main__":
    iface.launch()
```

---



## Original Label

```py
  "id2label": {
    "0": "ANG",
    "1": "CAL",
    "2": "DIS",
    "3": "FEA",
    "4": "HAP",
    "5": "NEU",
    "6": "SAD",
    "7": "SUR"
  },
```

--- 

## Intended Use

`Speech-Emotion-Classification` is designed for:

* **Speech Emotion Analytics** – Analyze speaker emotions in call centers, interviews, or therapeutic sessions.
* **Conversational AI Personalization** – Adjust voice assistant responses based on detected emotion.
* **Mental Health Monitoring** – Support emotion recognition in voice-based wellness or teletherapy apps.
* **Voice Dataset Curation** – Tag or filter speech datasets by emotion for research or model training. 
* **Media Annotation** – Automatically annotate podcasts, audiobooks, or videos with speaker emotion metadata.