segment | segment_offset | from | to | text |
---|---|---|---|---|
1 | 0 | 00:00:00.000 | 00:00:07.600 | And so my fellow Americans, ask not what your country can do for you, |
2 | 0 | 00:00:07.600 | 00:00:10.600 | ask what you can do for your country. |
AI Transcription from R using Whisper: Part 2
Introduction
In a previous blog post, I discussed using the audio.whisper R package to do local, AI-based audio transcription. It worked well but was prohibitively slow (e.g., ~1 minute to process each second of audio). In this blog post, I will discuss how to achieve considerable speed improvements on Windows through a combination of hardware and software. Parts will be more technical but hang in there and I’ll do my best to make it achievable.
Before we dive into things, I’ll provide a brief overview of all the steps.
- Check that our computer’s hardware supports CUDA
- Install/update the NVIDIA graphics driver on Windows
- Install and update the Windows Subsystem for Linux (WSL2) on Windows
- Install and setup the Ubuntu operating system via WSL2
- Install the CUDA Toolkit for WSL on Ubuntu
- Install R and dependency packages on Ubuntu
- Install audio.whisper R package with CUDA support on Ubuntu
- Test and time the model
Check for CUDA Support
This post assumes that you are using the Windows operating system and that your computer’s graphics card supports CUDA. To check that this is the case, first look up your graphics card’s model number. An easy way to do this on Windows 10/11 is to click on the desktop search bar (bottom-left of the screen next to the windows icon) and type in “Device Manager.” Then click the arrow next to “Display adapters” and find your graphics card’s model name. On my computer, it says “NVIDIA GeForce RTX 2060.” Then go to this link and click the “CUDA-Enabled NVIDIA Quadro and NVIDIA RTX” and “CUDA-Enabled GeForce and TITAN Products” blocks to open their accordions. Then search for your graphics card’s model number (the left tables are for desktop cards and the right tables are for notebook cards). I found “GeForce RTX 2060” on the list under GeForce and TITAN Products with a compute capability of 7.5. Thus, my card is supported!
Install the Newest NVIDIA Graphics Driver
Download and install the newest graphics driver for your card from NVIDIA. You should choose the Game Ready version. Note that you should not install the CUDA toolkit on Windows as doing so may confuse things and lead to issues later on (as we will be installing the CUDA toolkit for WSL in a later step).
Install and Update the Windows Subsystem for Linux
Open the Microsoft Store app (e.g., using the desktop search bar) and search for the “Windows Subsystem for Linux.” If it doesn’t come up in the search results, you may already have it installed - you can check this by clicking the “Library” button on the left sidebar in the app and searching for it there. If it does come up, click on the Install button. If you can’t find it, then open the Command Prompt app (e.g., using the desktop search bar) and type or paste the following command: wsl --install
. After it install using any method, it will ask you to restart your computer. Once restarted, open the Command Prompt app again and type or paste the following command: wsl --update
. This will ensure that you have the most recent version of WSL2 installed.
Install and Setup Ubuntu
In the Command Prompt app, type or paste the following command: wsl --install Ubuntu
. This will install the Ubuntu Linux operating system over the course of several minutes. After installation, it will prompt you to create a UNIX username and password. Use whatever you want but don’t lose this information as you will need it again later.
Install the CUDA Toolkit for WSL
In the Ubuntu console (which is opened automatically after Ubuntu is installed), enter or paste the following commands to install the CUDA Toolkit for WSL-Ubuntu. It will ask you to enter your password (created in the previous step) and may take several minutes to complete.
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update sudo apt-get -y install cuda-toolkit
Do not install any NVIDIA graphics drivers on Ubuntu directly (i.e., install cuda-toolkit
and not cuda
or cuda-drivers
). Ubuntu will inherit the Windows drivers you installed in a previous step via WSL.
If you get timeout errors when trying to install things on WSL, check to make sure that you are not connected to a VPN on Windows as this can mess things up.
Install R on Ubuntu
In the Ubuntu console, enter or paste the following commands to install R and other packages commonly used by R. You may have to hit ENTER and type Y
several times when prompted to do so.
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"
sudo apt install -y --no-install-recommends r-base
sudo apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev libudunits2-dev libgdal-dev cargo libfontconfig1-dev libcairo2-dev
sudo add-apt-repository ppa:c2d4u.team/c2d4u4.0+
sudo apt upgrade sudo apt install -y --no-install-recommends r-cran-devtools r-cran-av r-cran-tidyverse
Install audio.whisper with CUDA Support on Ubuntu
In the Ubuntu console, type or paste the following command: sudo R
to open the R console. You will then need to set several environmental variables before installing the audio.whisper package from GitHub. Do so by entering or pasting the following commands into the R console:
Sys.setenv(PATH = sprintf("%s:/usr/local/cuda/bin", Sys.getenv("PATH")))
Sys.setenv(CUDA_PATH = "/usr/local/cuda")
Sys.setenv(WHISPER_CUBLAS = "1")
::install_github("bnosac/audio.whisper") remotes
Test and Time the Model
Base Model and Short Audio Clip
In the R console, load the audio.whisper package and try it out on the JFK clip that took so long to process in the previous blog post. Note that there will be one important change to the commands from before. This time, when we load the model using the whisper()
function, we will add the use_gpu = TRUE
argument.
# Load the package from library
library(audio.whisper)
# Download or load from file the desired model (with GPU support)
<- whisper("base", use_gpu = TRUE)
model
# Construct file path to example audio file in package data
<- system.file(package = "audio.whisper", "samples", "jfk.wav")
jfk
# Run English transcription using the downloaded whisper model
<- predict(model, newdata = jfk, language = "en")
out
# Print transcript
$data out
The results look good/the same as before, but check out the timing!!!
$timing
out## $transcription_start
## [1] "2024-08-26 11:37:06 CDT"
##
## $transcription_end
## [1] "2024-08-26 11:37:07 CDT"
##
## $transcription_duration
## Time difference of 0.004918555 mins
Large Model and Long Audio Clip
With such a boost in speed, we can afford to try a larger model (e.g., "large-v3"
) on a longer audio clip (e.g., a 1.35 min poetry reading by Gerard Malanga that is rather noisy and therefore a good test of the model’s accuracy in real-world settings). This is also a chance to show how to process a file downloaded from the internet, in case that is of interest to any readers. We’ll use the following commands in the R console in Ubuntu:
# Load package from library (it was installed earlier via apt)
library(av)
# Download audio file from ubu.com
download.file(
url = "https://ubu.com/media/sound/malanga_gerard/archives/Malanga-Gerard_Archives_01-Gerard-Malanga_To-The-Young-Model-Name-Unknown.mp3",
destfile = "malanga.mp3",
mode = "wb"
)
# Convert audio from mp3 to 16 kHz wav
av_audio_convert(
"malanga.mp3",
output = "malanga.wav",
format = "wav",
sample_rate = 16000
)
# Download or load from file the large model with GPU support
<- whisper("large-v3", use_gpu = TRUE)
model
# Run English transcription using the downloaded whisper model
<- predict(model, newdata = "output.wav", language = "en")
out
# Print the transcript
$data out
segment | segment_offset | from | to | text |
---|---|---|---|---|
1 | 0 | 00:00:00.000 | 00:00:06.000 | Gary Malanga will read me some poems which are shortly to be illustrated by Andy Warhol. |
2 | 0 | 00:00:06.000 | 00:00:10.000 | The poems, as far as I can tell, do not relate particularly to this exhibit, |
3 | 0 | 00:00:10.000 | 00:00:13.000 | but Mr. Malanga thought it would be an appropriate setting for his poetry. |
4 | 0 | 00:00:13.000 | 00:00:17.000 | The length of the reading will be about 45 minutes. |
5 | 0 | 00:00:17.000 | 00:00:19.000 | Thank you so much for your patience, Mr. Malanga. |
6 | 0 | 00:00:19.000 | 00:00:26.000 | Can everyone hear me? |
7 | 0 | 00:00:26.000 | 00:00:33.000 | This poem is actually the first poem I ever wrote in this series of fashion poems, |
8 | 0 | 00:00:33.000 | 00:00:38.000 | entitled "To a Young Model Name Unknown," photographed by Francesco Scuvullo. |
9 | 0 | 00:00:38.000 | 00:00:47.000 | The Peckin Peck girl applauds the strategy of Hadley Kashmir, |
10 | 0 | 00:00:47.000 | 00:00:52.000 | now gentle country air left, to go calling in the afternoon, |
11 | 0 | 00:00:52.000 | 00:00:56.000 | pale gray, flannel-dressed, gracefully princessed, |
12 | 0 | 00:00:56.000 | 00:01:03.000 | its gray collar deeply cut, filled with a fluff of gray rabbit fur. |
13 | 0 | 00:01:03.000 | 00:01:07.000 | The new country look of the jumpsuit opposite. |
14 | 0 | 00:01:07.000 | 00:01:14.000 | Here, fresh, bright and Irish in white, stitched sheer navy blue wool, |
15 | 0 | 00:01:14.000 | 00:01:20.000 | Irish country airs, the changing outline of Irish fashion. |
16 | 0 | 00:01:20.000 | 00:01:21.780 | Thank you. |
The results look really good despite the background noise. The only errors I noticed were line 11, I think he says “princess shaped” rather than “princessed” (though I could be wrong) and in line 16, I can’t hear him say “Thank you.” so that may have been hallucinated (or perhaps in the background). Not bad at all. And check out the timing.
$timing
out## $transcription_start
## [1] "2024-08-16 13:33:07 CDT"
##
## $transcription_end
## [1] "2024-08-16 13:33:19 CDT"
##
## $transcription_duration
## Time difference of 0.1988164 mins
Wrap-up
If you want to save the transcript, you can enter the following command in the R console: saveRDS(out, "malanga.rds")
and it will create a serialized R data file containing all the transcript data (e.g., text and time stamps). By default, this file will be saved in the same folder on your Windows file system that you ran the Command Prompt app from (e.g., “C:/Users/jeffg”). However, you can save anywhere using WSL’s /mnt/
system. For example, if you wanted to save it to “C:/Users/jeffg/Desktop”, then you would use "/mnt/c/users/jeffg/Desktop/malanga.rds"
as the second argument to saveRDS()
. Or if you wanted to save it to a mapped network drive like “Z:/affcomlab/transcription”, then you would use "/mnt/z/affcomlab/transcription/malanga.rds"
.
That wraps up this blog post. In the next part, I will discuss more practical aspects of using this technology. For example, I’ll talk about how to generate a list of audio/videos files on your hard drive (or elsewhere) and then iterate over them to create transcripts from many files all at once.
Part 3 coming soon…