How to use Whisper in Python

Whisper is an AI model from OpenAI that allows you to convert any audio to text with high quality and accuracy. In this article I will show you how to use this AI model to get transcriptions from an audio file and how to run it with Python.

1. Creating your environment

At this point, I recommend that you use Conda to build the environment for Python and handle dependencies with poetry.

mkdir whisper_project
cd whisper_project
conda create --name whisper_project python=3.10
conda activate whisper_project
conda install -c conda-forge ffmpeg 
conda install -c conda-forge poetry
poetry init

Note that I installed ffmpeg to handle audio files in the environment, this is necessary to use Whisper with Python.

2. Installing Whisper

When you have your environment ready, you can install Whisper using the following command:

poetry add openai-whisper

3. Using Whisper

Now that you have Whisper installed, you can create a main.py file and import Whisper as a Python package, then load the model you want to use. There are five models sizes offering speed and accuracy tradeoff.

Size	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	`tiny.en`	`tiny`	~1 GB	~32x
base	`base.en`	`base`	~1 GB	~16x
small	`small.en`	`small`	~2 GB	~6x
medium	`medium.en`	`medium`	~5 GB	~2x
large	N/A	`large`	~10 GB	1x

Here is the code in main.py to use Whisper with Python:

# main.py
import whisper
model = whisper.load_model('large')

def get_transcribe(audio: str, language: str = 'en'):
    return model.transcribe(audio=audio, language=language, verbose=True)

if __name__ == "__main__":
    result = get_transcribe(audio='./input/audio.wav')
    print('-'*50)
    print(result.get('text', ''))

With the get_transcribe function you can get the transcription of an audio file, this function has 2 arguments the audio path and the language. The audio is the path to the audio file in your environment, finally language is the idiom of the audio file, it is possible that Whisper could recognize the audio idiom but, for this it works better if you define it from the start. In this case I will use the following audio file and get the transcription.

4. Running the script

Now in you terminal with the following command, you can run the script:

python main.py

🎉🎉🎉

Now I can create a Jupyter notebook, in this case the file is called demo.ipyhnb and use Whisper in the notebook.

5. Save results in files

Whisper has a set of utilities that allow you to save the results in different formats to handle transcription results. You can use the get_writer function to get a writer and save the results to a file with a specified format.

from whisper.utils import get_writer

writer = get_writer("tsv", './')
writer(results, 'transcribe.tsv')

Implementing this in the main.py file, you can save the results in a file with the following code:

# main.py
import whisper
from whisper.utils import get_writer
model = whisper.load_model('large')


def get_transcribe(audio: str, language: str = 'en'):
    return model.transcribe(audio=audio, language=language, verbose=True)


def save_file(results, format='tsv'):
    writer = get_writer(format, './output/')
    writer(results, f'transcribe.{format}')


if __name__ == "__main__":
    result = get_transcribe(audio='./input/audio.wav')
    print('-'*50)
    print(result.get('text', ''))
    save_file(result)
    save_file(result, 'txt')
    save_file(result, 'srt')

As a result now you have the transcription in a tsv, txt and srt formats.

And the project structure will look like this:

.
├── input
│   └── audio.wav
├── main.py
├── output
│   ├── transcribe.srt
│   ├── transcribe.tsv
│   └── transcribe.txt
├── poetry.lock
└── pyproject.toml