How to transcribe YouTube videos with OpenAI Whisper

Picture by Seej Nguyen, Pexels.

8 minutes to read

Machine Learning Natural Language Processing

OpenAI Whisper is a new open source automatic speech recognition (ASR) model from Elon Musk’s OpenAI project that has also brought us the incredible GPT-3 language models. Like GPT-3, it’s been trained on a very large and diverse dataset of audio files and can perform a range of functions.

As well as transcription - the process of converting audio to text - it can also perform language detection and translation and has several models trained to suit different languages to help improve accuracy.

You could, of course, use Whisper for a wide range of applications, including transcribing lectures, podcasts, interviews, or call recordings to text. But in this post, we’re going to focus on how to transcribe YouTube videos with OpenAI Whisper. I’ll show you how easy it is to download a video from YouTube, transcode the video file to an MP3 audio file, and then use OpenAI Whisper to transcribe it and extract the text from within.

Install the packages

For this project I’m working inside a Jupyter notebook, but you can do this in a regular Python script if you prefer. Whisper uses PyTorch, so you’ll need a powerful data science workstation to run it. We’ll be using YouTube_DL to download the YouTube videos, FFMPEG to transcode them to MP3, and OpenAI Whisper to transcribe them.

To use OpenAI Whisper you will need to install the latest version from the OpenAI GitHub repository using a Pip install. You’ll also need to install the setuptools-rust package and ensure that the ffmpeg transcoding package is installed on your machine.

You can install ffmpeg by entering sudo apt -y install ffmpeg if you’re working locally. If you’re running inside a Docker container, such as the NVIDIA Data Science Stack, you’ll need to use !apt -y install ffmpeg.

!pip3 install youtube_dl
!pip install git+https://github.com/openai/whisper.git
!pip3 install setuptools-rust
!apt -y install ffmpeg

Import the packages

Once you’ve installed the software, the next step is to import the packages we’re using. We need to import youtube_dl to download the YouTube videos, the unicode_literals module from __future__, and whisper to transcribe the audio.

from __future__ import unicode_literals
import youtube_dl
import whisper

OpenAI Whisper currently throws various warnings. To suppress these, we’ll import the warnings package and use the filterwarnings function to ignore the warnings. You’ll benefit from leaving this out in your own code.

import warnings
warnings.filterwarnings("ignore")

Download and transcode a YouTube video

First, we’ll create a function that uses YouTube_DL to download a video from YouTube and transcode it to an MP3 audio file using the FFMPEG package. The function downloads the video, converts it to MP3, and then deletes the original video file, and saves a local copy of the MP3 and returns the filename.

def save_to_mp3(url):
    """Save a YouTube video URL to mp3.

    Args:
        url (str): A YouTube video URL.

    Returns:
        str: The filename of the mp3 file.
    """

    options = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
    }

    with youtube_dl.YoutubeDL(options) as downloader:
        downloader.download(["" + url + ""])
                
    return downloader.prepare_filename(downloader.extract_info(url, download=False)).replace(".m4a", ".mp3")

Now we can use the function to download a YouTube video and save it as an MP3 file.

youtube_url = "https://www.youtube.com/watch?v=UFNRUuBARM4"
filename = save_to_mp3(youtube_url)

[youtube] UFNRUuBARM4: Downloading webpage
[download] Destination: Liz Truss in bizarre speech about cheese, pork and apples at Conservative conference in 2014-UFNRUuBARM4.m4a
[download] 100% of 2.58MiB in 00:5589KiB/s ETA 00:00
[ffmpeg] Correcting container in "Liz Truss in bizarre speech about cheese, pork and apples at Conservative conference in 2014-UFNRUuBARM4.m4a"
[ffmpeg] Destination: Liz Truss in bizarre speech about cheese, pork and apples at Conservative conference in 2014-UFNRUuBARM4.mp3
Deleting original file Liz Truss in bizarre speech about cheese, pork and apples at Conservative conference in 2014-UFNRUuBARM4.m4a (pass -k to keep)
[youtube] UFNRUuBARM4: Downloading webpage

filename

'Liz Truss in bizarre speech about cheese, pork and apples at Conservative conference in 2014-UFNRUuBARM4.mp3'

Transcribe the audio file using Whisper

Next you need to use the load_model() function to load the Whisper model you wish to use. There are various size models available, including some language specific models, so you may wish to experiment to see which model gives the best results for your audio.

The issue I encountered was that all the models larger than the standard base model were too large for my 6GB NVIDIA GPU to handle, since PyTorch itself is using about 4.2GB of the GPU memory. If you want to run one of the larger, more accurate models, you’re going to need a more powerful GPU.

model = whisper.load_model("base")

To transcribe the audio file we need to use the transcribe() function and pass in the file path of the audio file stored in filename. I found that I had to set the fp16 argument to False in order to prevent an error being thrown which said: “RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same.”

result = model.transcribe(filename, fp16=False)

Once this has run through the model, you can extract the text part from the result DecodingResult object returned. The results are very impressive, given the speed and the fact we’re using one of the smaller models in a non-language specific mode.

result['text']

"I want to see us eating more British food here in Britain. At the moment we import two thirds of all of our apples. We import nine tenths of all of our pears. We import two thirds of our cheese. That is a disgrace. From the apples that dropped on Isaac Newton's head to the orchards of nursery rhymes, this fruit has always been part of Britain. It has been part of our country. I want our children to grow up knowing the taste of a British apple, of Cornish sardines, of harifred chapairs, of Norfolk turkey, of melting mobri pork pies, and of course of black pudding. Under a Conservative Government, Britain will lead the world in food, farming, and the environment. In a fortnight, I'm going to Paris for the world's largest food trade fair, and I will be bigging up British products. In December, I'll be in Beijing, opening up new pork markets. I am determined that our producers will have access to more markets, both home and abroad, generating jobs and security for millions. I'm determined to press ahead, restoring habitats, cleaning rivers, and improving our atmosphere so that future generations can enjoy clean air and enjoy the countryside. I'm determined that our flood defences will always be strong enough to protect us from the ravages of a changing climate. And I will not rest until the British apple is back at the top of the tree."

Matt Clarke, Saturday, October 01, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.