Update README.md
Browse files
README.md
CHANGED
@@ -13,9 +13,12 @@ library_name: espnet
|
|
13 |
pipeline_tag: automatic-speech-recognition
|
14 |
---
|
15 |
|
16 |
-
|
17 |
|
18 |
-
|
|
|
|
|
|
|
19 |
|
20 |
Inference examples can be found on our [project page](https://www.wavlab.org/activities/2024/owsm/).
|
21 |
The Gradio demo is [here](https://huggingface.co/spaces/pyf98/OWSM_v3_demo).
|
@@ -24,9 +27,9 @@ The Gradio demo is [here](https://huggingface.co/spaces/pyf98/OWSM_v3_demo).
|
|
24 |
Additionally, OWSM v4 applies 8 times subsampling (instead of 4 times in OWSM v3.1) to the log Mel features, leading to a final resolution of 80 ms in the encoder.
|
25 |
When running inference, we recommend setting `maxlenratio=1.0` (default) instead of smaller values.
|
26 |
|
27 |
-
This repo contains a
|
28 |
It is trained on 320k hours of public speech data.
|
29 |
-
The newly curated data
|
30 |
|
31 |
It supports the following speech-to-text tasks:
|
32 |
- Language identification
|
|
|
13 |
pipeline_tag: automatic-speech-recognition
|
14 |
---
|
15 |
|
16 |
+
🏆 **News:** Our [OWSM v4 paper](https://www.isca-archive.org/interspeech_2025/peng25c_interspeech.html) won the [Best Student Paper Award](https://isca-speech.org/ISCA-Awards) at INTERSPEECH 2025!
|
17 |
|
18 |
+
|
19 |
+
[Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) is the first **fully open** Whisper-style speech foundation model.
|
20 |
+
It reproduces and advances OpenAI's Whisper-style training using publicly available data and open-source toolkits.
|
21 |
+
The code, pre-trained model weights, and training logs are publicly released to promote open science in speech foundation models.
|
22 |
|
23 |
Inference examples can be found on our [project page](https://www.wavlab.org/activities/2024/owsm/).
|
24 |
The Gradio demo is [here](https://huggingface.co/spaces/pyf98/OWSM_v3_demo).
|
|
|
27 |
Additionally, OWSM v4 applies 8 times subsampling (instead of 4 times in OWSM v3.1) to the log Mel features, leading to a final resolution of 80 ms in the encoder.
|
28 |
When running inference, we recommend setting `maxlenratio=1.0` (default) instead of smaller values.
|
29 |
|
30 |
+
This repo contains a medium-sized model with 1B parameters, developed by [Yifan Peng](https://pyf98.github.io/) (CMU).
|
31 |
It is trained on 320k hours of public speech data.
|
32 |
+
The newly curated data are publicly released: https://huggingface.co/datasets/espnet/yodas_owsmv4
|
33 |
|
34 |
It supports the following speech-to-text tasks:
|
35 |
- Language identification
|