AI Transcription from R using Whisper: Part 2

teaching
audio
AI
Speedup on Windows using WSL2 and CUDA
Author

Jeffrey Girard

Published

August 16, 2024

Introduction

In a previous blog post, I discussed using the audio.whisper R package to do local, AI-based audio transcription. It worked well but was prohibitively slow (e.g., ~1 minute to process each second of audio). In this blog post, I will discuss how to achieve considerable speed improvements on Windows through a combination of hardware and software. Parts will be more technical but hang in there and I’ll do my best to make it achievable.

Before we dive into things, I’ll provide a brief overview of all the steps.

  1. Check that our computer’s hardware supports CUDA
  2. Install/update the NVIDIA graphics driver on Windows
  3. Install and update the Windows Subsystem for Linux (WSL2) on Windows
  4. Install and setup the Ubuntu operating system via WSL2
  5. Install the CUDA Toolkit for WSL on Ubuntu
  6. Install R and dependency packages on Ubuntu
  7. Install audio.whisper R package with CUDA support on Ubuntu
  8. Test and time the model

Check for CUDA Support

This post assumes that you are using the Windows operating system and that your computer’s graphics card supports CUDA. To check that this is the case, first look up your graphics card’s model number. An easy way to do this on Windows 10/11 is to click on the desktop search bar (bottom-left of the screen next to the windows icon) and type in “Device Manager.” Then click the arrow next to “Display adapters” and find your graphics card’s model name. On my computer, it says “NVIDIA GeForce RTX 2060.” Then go to this link and click the “CUDA-Enabled NVIDIA Quadro and NVIDIA RTX” and “CUDA-Enabled GeForce and TITAN Products” blocks to open their accordions. Then search for your graphics card’s model number (the left tables are for desktop cards and the right tables are for notebook cards). I found “GeForce RTX 2060” on the list under GeForce and TITAN Products with a compute capability of 7.5. Thus, my card is supported!

Install the Newest NVIDIA Graphics Driver

Download and install the newest graphics driver for your card from NVIDIA. You should choose the Game Ready version. Note that you should not install the CUDA toolkit on Windows as doing so may confuse things and lead to issues later on (as we will be installing the CUDA toolkit for WSL in a later step).

Install and Update the Windows Subsystem for Linux

Open the Microsoft Store app (e.g., using the desktop search bar) and search for the “Windows Subsystem for Linux.” If it doesn’t come up in the search results, you may already have it installed - you can check this by clicking the “Library” button on the left sidebar in the app and searching for it there. If it does come up, click on the Install button. If you can’t find it, then open the Command Prompt app (e.g., using the desktop search bar) and type or paste the following command: wsl --install. After it install using any method, it will ask you to restart your computer. Once restarted, open the Command Prompt app again and type or paste the following command: wsl --update. This will ensure that you have the most recent version of WSL2 installed.

Install and Setup Ubuntu

In the Command Prompt app, type or paste the following command: wsl --install Ubuntu. This will install the Ubuntu Linux operating system over the course of several minutes. After installation, it will prompt you to create a UNIX username and password. Use whatever you want but don’t lose this information as you will need it again later.

Install the CUDA Toolkit for WSL

In the Ubuntu console (which is opened automatically after Ubuntu is installed), enter or paste the following commands to install the CUDA Toolkit for WSL-Ubuntu. It will ask you to enter your password (created in the previous step) and may take several minutes to complete.

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit
Important

Do not install any NVIDIA graphics drivers on Ubuntu directly (i.e., install cuda-toolkit and not cuda or cuda-drivers). Ubuntu will inherit the Windows drivers you installed in a previous step via WSL.

If you get timeout errors when trying to install things on WSL, check to make sure that you are not connected to a VPN on Windows as this can mess things up.

Install R on Ubuntu

In the Ubuntu console, enter or paste the following commands to install R and other packages commonly used by R. You may have to hit ENTER and type Y several times when prompted to do so.

wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"
sudo apt install -y --no-install-recommends r-base
sudo apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev libudunits2-dev libgdal-dev cargo libfontconfig1-dev libcairo2-dev
sudo add-apt-repository ppa:c2d4u.team/c2d4u4.0+
sudo apt upgrade
sudo apt install -y --no-install-recommends r-cran-devtools r-cran-av r-cran-tidyverse

Install audio.whisper with CUDA Support on Ubuntu

In the Ubuntu console, type or paste the following command: sudo R to open the R console. You will then need to set several environmental variables before installing the audio.whisper package from GitHub. Do so by entering or pasting the following commands into the R console:

Sys.setenv(PATH = sprintf("%s:/usr/local/cuda/bin", Sys.getenv("PATH")))
Sys.setenv(CUDA_PATH = "/usr/local/cuda")
Sys.setenv(WHISPER_CUBLAS = "1")
remotes::install_github("bnosac/audio.whisper")

Test and Time the Model

Base Model and Short Audio Clip

In the R console, load the audio.whisper package and try it out on the JFK clip that took so long to process in the previous blog post. Note that there will be one important change to the commands from before. This time, when we load the model using the whisper() function, we will add the use_gpu = TRUE argument.

# Load the package from library
library(audio.whisper)

# Download or load from file the desired model (with GPU support)
model <- whisper("base", use_gpu = TRUE)

# Construct file path to example audio file in package data
jfk <- system.file(package = "audio.whisper", "samples", "jfk.wav")

# Run English transcription using the downloaded whisper model
out <- predict(model, newdata = jfk, language = "en")

# Print transcript
out$data
segment segment_offset from to text
1 0 00:00:00.000 00:00:07.600 And so my fellow Americans, ask not what your country can do for you,
2 0 00:00:07.600 00:00:10.600 ask what you can do for your country.

The results look good/the same as before, but check out the timing!!!

out$timing
## $transcription_start
## [1] "2024-08-26 11:37:06 CDT"
## 
## $transcription_end
## [1] "2024-08-26 11:37:07 CDT"
## 
## $transcription_duration
## Time difference of 0.004918555 mins

Large Model and Long Audio Clip

With such a boost in speed, we can afford to try a larger model (e.g., "large-v3") on a longer audio clip (e.g., a 1.35 min poetry reading by Gerard Malanga that is rather noisy and therefore a good test of the model’s accuracy in real-world settings). This is also a chance to show how to process a file downloaded from the internet, in case that is of interest to any readers. We’ll use the following commands in the R console in Ubuntu:

# Load package from library (it was installed earlier via apt)
library(av)

# Download audio file from ubu.com
download.file(
  url = "https://ubu.com/media/sound/malanga_gerard/archives/Malanga-Gerard_Archives_01-Gerard-Malanga_To-The-Young-Model-Name-Unknown.mp3", 
  destfile = "malanga.mp3", 
  mode = "wb"
)

# Convert audio from mp3 to 16 kHz wav
av_audio_convert(
  "malanga.mp3", 
  output = "malanga.wav", 
  format = "wav", 
  sample_rate = 16000
)

# Download or load from file the large model with GPU support
model <- whisper("large-v3", use_gpu = TRUE)

# Run English transcription using the downloaded whisper model
out <- predict(model, newdata = "output.wav", language = "en")

# Print the transcript
out$data
segment segment_offset from to text
1 0 00:00:00.000 00:00:06.000 Gary Malanga will read me some poems which are shortly to be illustrated by Andy Warhol.
2 0 00:00:06.000 00:00:10.000 The poems, as far as I can tell, do not relate particularly to this exhibit,
3 0 00:00:10.000 00:00:13.000 but Mr. Malanga thought it would be an appropriate setting for his poetry.
4 0 00:00:13.000 00:00:17.000 The length of the reading will be about 45 minutes.
5 0 00:00:17.000 00:00:19.000 Thank you so much for your patience, Mr. Malanga.
6 0 00:00:19.000 00:00:26.000 Can everyone hear me?
7 0 00:00:26.000 00:00:33.000 This poem is actually the first poem I ever wrote in this series of fashion poems,
8 0 00:00:33.000 00:00:38.000 entitled "To a Young Model Name Unknown," photographed by Francesco Scuvullo.
9 0 00:00:38.000 00:00:47.000 The Peckin Peck girl applauds the strategy of Hadley Kashmir,
10 0 00:00:47.000 00:00:52.000 now gentle country air left, to go calling in the afternoon,
11 0 00:00:52.000 00:00:56.000 pale gray, flannel-dressed, gracefully princessed,
12 0 00:00:56.000 00:01:03.000 its gray collar deeply cut, filled with a fluff of gray rabbit fur.
13 0 00:01:03.000 00:01:07.000 The new country look of the jumpsuit opposite.
14 0 00:01:07.000 00:01:14.000 Here, fresh, bright and Irish in white, stitched sheer navy blue wool,
15 0 00:01:14.000 00:01:20.000 Irish country airs, the changing outline of Irish fashion.
16 0 00:01:20.000 00:01:21.780 Thank you.

The results look really good despite the background noise. The only errors I noticed were line 11, I think he says “princess shaped” rather than “princessed” (though I could be wrong) and in line 16, I can’t hear him say “Thank you.” so that may have been hallucinated (or perhaps in the background). Not bad at all. And check out the timing.

out$timing
## $transcription_start
## [1] "2024-08-16 13:33:07 CDT"
## 
## $transcription_end
## [1] "2024-08-16 13:33:19 CDT"
## 
## $transcription_duration
## Time difference of 0.1988164 mins

Wrap-up

If you want to save the transcript, you can enter the following command in the R console: saveRDS(out, "malanga.rds") and it will create a serialized R data file containing all the transcript data (e.g., text and time stamps). By default, this file will be saved in the same folder on your Windows file system that you ran the Command Prompt app from (e.g., “C:/Users/jeffg”). However, you can save anywhere using WSL’s /mnt/ system. For example, if you wanted to save it to “C:/Users/jeffg/Desktop”, then you would use "/mnt/c/users/jeffg/Desktop/malanga.rds" as the second argument to saveRDS(). Or if you wanted to save it to a mapped network drive like “Z:/affcomlab/transcription”, then you would use "/mnt/z/affcomlab/transcription/malanga.rds".

That wraps up this blog post. In the next part, I will discuss more practical aspects of using this technology. For example, I’ll talk about how to generate a list of audio/videos files on your hard drive (or elsewhere) and then iterate over them to create transcripts from many files all at once.

Part 3 coming soon…