--- license: mit language: - et base_model: - openai/whisper-large-v3-turbo pipeline_tag: automatic-speech-recognition library_name: transformers --- ## Introduction This model is OpenAI Whisper large-v3-turbo, finetuned on 1400 hours of audio with manually created verbatim transcriptions from the TalTech Estonian Speech Dataset 1.0 (https://cs.taltech.ee/staff/tanel.alumae/data/est-pub-asr-data/). ## Usage It's a finetuned vesion of Whisper large-v3-turbo and can be therefore used via Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Accelerate to reduce the model loading time: ```bash pip install --upgrade pip pip install --upgrade transformers accelerate ``` The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audios of arbitrary length: ```python import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline from datasets import load_dataset device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model_id = "TalTechNLP/whisper-large-v3-turbo-et-verbatim" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch_dtype, device=device, ) audio = "demo/etteütlus2024.wav" result = pipe(sample, generate_kwargs={"task": "transcribe", "language": "et"}) print(result) ``` There is a also a ct2 version of the model that can be used with tools that are based on `faster-whisper`, e.g. `whisper-ctranslate2` command line program, e.g.: ``` $ whisper-ctranslate2 --model_directory ct2 --language et --vad_filter True --threads 8 --output_dir demo demo/etteütlus2024.wav Detected language 'Estonian' with probability 1.000000 [00:00.620 --> 00:08.820] Kas pole teps mitte kihvt, et Haridus- ja Teadusministeerium paikneb Tartus Munga tänaval? [00:08.820 --> 00:23.420] Seal ülikooli peahoonest mõne kukesammu kaugusel tuleb pedagoogikaalased otsused langetada kevisse raiutud imposantsete kultuuriheeroste märksa pilgu all. [00:23.420 --> 00:32.680] Peeter Põllu esimese haridusministri rühikas selg tuletab meelde koolmeistrite määravat osatähtsust ühiskonnas. [00:32.680 --> 00:45.140] Ning üksi silmi teineteist jälgivad Kreutzwald ja Kalevipoeg kõrvu Oskar Lutsuliku kaine literaadi pilguga ei lase unustada Eesti vaimuilma alusväärtusi. [00:45.140 --> 00:52.640] Vahest peaks valitsusegi Stenbocki majast rahvusülikooli akadeemilisse mõju välja kupattama. [00:52.640 --> 01:05.860] Nii oleks võimukandjatel ehk mahti ilmavaate turgutamiseks linnaraamatukogust kübekene tarkust nõutada või Tartu Kunstimuuseumis kultustaieseid nautida. [01:05.860 --> 01:17.500] Too piisatorni sarnane majamürakas võib tekitada muidugi äraspidise tunde, et Emajõe ja Ateenas on alalõpmata midagi viltu. Transcription results written to 'demo' directory ``` ## Citation ``` @inproceedings{alumae-etal-2023-automatic, title = "Automatic Closed Captioning for {E}stonian Live Broadcasts", author = {Alum{\"a}e, Tanel and Kalda, Joonas and Bode, K{\"u}lliki and Kaitsa, Martin}, booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = may, year = "2023", address = "T{\'o}rshavn, Faroe Islands", publisher = "University of Tartu Library", url = "https://aclanthology.org/2023.nodalida-1.49", pages = "492--499" } ```