AudioToText

Transcribe audio using Whisper from OpenAI.

Translate audio using Whisper and DeepL translator.

Generate captions using VTT or SRT file formats.

Introducing Whisper (OpenAI Blog)

🇪🇸 Vídeo sobre Whisper (Dot CSV)

How to use

Open AudioToText in Google Colab and follow the step-by-step instructions.

A Cloud GPU will be assigned to you to run the notebook code to transcribe and translate your audio files.

If you want to run the code in your own computer check local installation.

Features

English transcription
Non-English transcription
Any-to-English translation
Any-to-Any* translation

Translate the transcriptions using DeepL translator.

* See supported languages by DeepL
Save transcriptions and captions in different formats: TXT, VTT, SRT, TSV and JSON.
Choose between open-source models or API.
AudioToText CLI for local usage.

There are several examples in the examples folder.

Whisper Features

Audio transcription from English using Whisper

task: Transcribe

language: English

Audio transcription from almost any language using Whisper

task: Transcribe

language: Auto-Detect or select the source language of your audio file

Supported source languages by Whisper

``` Afrikaans Albanian Amharic Arabic Armenian Assamese Azerbaijani Bashkir Basque Belarusian Bengali Bosnian Breton Bulgarian Burmese Castilian Catalan Chinese Croatian Czech Danish Dutch English Estonian Faroese Finnish Flemish French Galician Georgian German Greek Gujarati Haitian Haitian Creole Hausa Hawaiian Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Javanese Kannada Kazakh Khmer Korean Lao Latin Latvian Letzeburgesch Lingala Lithuanian Luxembourgish Macedonian Malagasy Malay Malayalam Maltese Maori Marathi Moldavian Moldovan Mongolian Myanmar Nepali Norwegian Nynorsk Occitan Panjabi Pashto Persian Polish Portuguese Punjabi Pushto Romanian Russian Sanskrit Serbian Shona Sindhi Sinhala Sinhalese Slovak Slovenian Somali Spanish Sundanese Swahili Swedish Tagalog Tajik Tamil Tatar Telugu Thai Tibetan Turkish Turkmen Ukrainian Urdu Uzbek Valencian Vietnamese Welsh Yiddish Yoruba ```

Audio translation to English using Whisper

task: Translate to English

language: Auto-Detect or select the source language of your audio file

Audio translation using DeepL translator

Translation to other languages than English is not supported by Whisper.

However, as an alternative you can use DeepL API to translate the transcription to another language.

task: Transcribe

language: Auto-Detect or select the source language of your audio file *

Supported source languages by DeepL

source_lang ``` Bulgarian Chinese Czech Danish Dutch English Estonian Finnish French German Greek Hungarian Indonesian Italian Japanese Korean Latvian Lithuanian Norwegian Polish Portuguese Romanian Russian Slovak Slovenian Spanish Swedish Turkish Ukrainian ```

* If the source language of your audio file is supported by Whisper but not supported by DeepL you can use the Translate to English task to generate an English transcription first and translate that to your desired target language using DeepL.

deepl_api_key: Your DeepL API key generated after registering for a DeepL Developer Account.

deepl_target_language: Select your desired language

Available target languages by DeepL

target_lang ``` Bulgarian Chinese (simplified) Czech Danish Dutch English (American) English (British) Estonian Finnish French German Greek Hungarian Indonesian Italian Japanese Korean Latvian Lithuanian Norwegian Polish Portuguese (Brazilian) Portuguese (European) Romanian Russian Slovak Slovenian Spanish Swedish Turkish Ukrainian ```

The DeepL API has a free quota of 500,000 characters per month.

If you exceed your free quota you can upgrade to DeepL API Pro or try using the Free Translator Files web feature uploading the generated transcripts.

See this example with audio transcriptions in different languages using Whisper and translation to spanish using DeepL.

Save transcripts to different formats

output_formats: Select the desired transcript formats (comma-separated)

Available formats: txt, vtt, srt, tsv, json

txt is recommended to read a transcription.

vtt or srt are recommended to add captions to an audio or video.

Transcript files will be located in the audio_transcription folder.

Add captions to VLC media player

If you use VLC to play video or audio files, you can add your vtt or srt transcripts as captions by drag-and-drop the transcript file to the media player or go to Subtitles -> Add Subtitle File.

With audio-only files you will need to enable a visualization in Audio -> Visualizations.

Local installation

If you have a powerful computer with GPU hardware acceleration, you can run the notebook or CLI in your local machine.

You can also use them locally without a powerful GPU using API, as it always runs in the cloud.

CPU execution is also available, but it is much slower and the Colab version or API is recommended if you do not have a decent GPU. You might, however, try to use the smaller models (tiny, base, small) on your CPU.

Using AudioToText CLI

A plain python script is available to use in your system without Jupyter.

Install AudioToText CLI

Clone this repository or download the audiototext.py script (right-click -> Save as…).
Install Python (3.8 - 3.10)
Install ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

AudioToText CLI usage

# Transcribe english.wav using large-v2 model to TXT, VTT, SRT, TSV and JSON formats
python audiototext.py examples/english/english.wav --model large-v2 --output_dir audio_transcription

# Translate french.wav from French to English using small model to TXT format
python audiototext.py examples/french-to-english/french.wav --task translate --language French --output_format txt

# Transcribe english_japanese.mp3 using API to TXT, VTT and SRT formats
python audiototext.py examples/multi-language/english_japanese.mp3 --output_formats txt,vtt,srt --api_key sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Transcribe multiple files using Whisper large-v2 model and then translate the generated transcripts to Spanish using DeepL API to TXT, VTT and SRT formats
python audiototext.py chinese.wav bruce.mp3 english_japanese.mp3 french.wav --model large-v2 --output_formats txt,vtt,srt --deepl_target_language Spanish --deepl_api_key xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:xx

# See all available options
python audiototext.py -h

positional arguments:
  audio_file            source file to transcribe

optional arguments:
  -h, --help            show this help message and exit
  --task {transcribe,translate}
                        transcribe (default) or translate (to English)
  --model {tiny,base,small,medium,large-v1,large-v2}
                        model to use (default: small)
  --language {Auto-Detect,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
                        source file language (default: Auto-Detect)
  --prompt PROMPT       provide context about the audio or encourage a specific writing style, see https://platform.openai.com/docs/guides/speech-to-text/prompting
  --coherence_preference {True,False}
                        True (default): More coherence, but may repeat text. False: Less repetitions, but may have less coherence
  --api_key API_KEY     if set with your OpenAI API Key (https://platform.openai.com/account/api-keys), the OpenAI API is used, which can improve the inference speed substantially, but it has an associated cost, see API pricing: https://openai.com/pricing#audio-models.
                        API model is large-v2 (ignores --model)
  --output_formats OUTPUT_FORMATS, --output_format OUTPUT_FORMATS
                        desired result formats (default: txt,vtt,srt,tsv,json)
  --output_dir OUTPUT_DIR
                        folder to save results (default: audio_transcription)
  --deepl_api_key DEEPL_API_KEY
                        DeepL API key, if you want to translate results using DeepL. Get a DeepL Developer Account API Key: https://www.deepl.com/pro-api
  --deepl_target_language {Bulgarian,Chinese,Chinese (simplified),Czech,Danish,Dutch,English,English (American),English (British),Estonian,Finnish,French,German,Greek,Hungarian,Indonesian,Italian,Japanese,Korean,Latvian,Lithuanian,Norwegian,Polish,Portuguese,Portuguese (Brazilian),Portuguese (European),Romanian,Russian,Slovak,Slovenian,Spanish,Swedish,Turkish,Ukrainian}
                        results target language if you want to translate results using DeepL (--deepl_api_key required)
  --deepl_coherence_preference {True,False}
                        True (default): Share context between lines while translating. False: Translate each line independently
  --deepl_formality {default,formal,informal}
                        whether the translated text should lean towards formal or informal language (languages with formality supported: German,French,Italian,Spanish,Dutch,Polish,Portuguese,Russian)
  --skip-install        skip pip dependencies installation

Using Google Colab with your local environment

Google Colab lets you connect to a local runtime using Jupyter. This allows you to use the notebook using your local hardware and have access to your local file system.

How to set up and connect to a local runtime in Google Colab

Using Jupyter Notebook

If you do not want to rely on Google Colab or use the AudioToText CLI, you can use the Jupyter Notebook interface.

How to install Jupyter Notebook

Clone or download this repository and run inside this repository folder:

jupyter notebook AudioToText.ipynb

Or just run jupyter notebook without cloning this repository and Upload the AudioToText.ipynb file (right-click -> Save as…).

Using Jupyter Lab

An alternative to the Jupyter Notebook interface is the Jupyter Lab interface.

How to install Jupyter Lab

jupyter lab

Open the notebook using a URL:

File -> Open from URL…

https://raw.githubusercontent.com/Carleslc/AudioToText/master/AudioToText.ipynb

Using Whisper CLI

If you do not need Cloud GPU and you do not want to translate using DeepL then you can just use the Whisper CLI in your console as follows:

Install Whisper CLI locally

Install Python (3.8 - 3.10)
Install ffmpeg
Install Whisper CLI

pip install -U openai-whisper

Whisper CLI usage

# Transcribe english.wav using large-v2 model to TXT, VTT, SRT, TSV and JSON formats
whisper english.wav --model large-v2 --output_dir audio_transcription --output_format all

# Translate french.wav from French to English using small model to TXT format
whisper french.wav --task translate --language French --output_dir audio_transcription --output_format txt

# Transcribe multiple files using large-v2 model to TXT, VTT, SRT, TSV and JSON formats
whisper chinese.wav bruce.mp3 english_japanese.mp3 french.wav --model large-v2 --output_dir audio_transcription

# See all available options
whisper --help