AI Transcription from R using Whisper: Part 3

teaching

audio

Easing the WSL2 Setup with Docker

Author

Jeffrey Girard

Published

November 10, 2024

Note

You can view this blog post as a replacement for the previous one (unless you are interested in all the nitty-gritty details).

Introduction

In a previous blog post on AI Transcription, I discussed the performance benefits of moving to WSL2 for CUDA support on Windows. However, the process of setting up and using the virtual machine was very technical, which may be a barrier for many users. In this blog post, I will discuss the use of a technology called Docker to ease the setup and use of WSL2.

Docker is a tool that makes it easy to package and run applications in a “container,” which is a lightweight, standalone environment that includes everything an application needs to run. This way, the application will work the same way on any computer without worrying about setup or compatibility issues.

There are three main components to Docker: a Dockerfile, a Docker image, and a Docker container. To explain these, I will use a pizza analogy. Just like a chef might create a recipe that contains step-by-step instructions for preparing a pizza, a power user might create a Dockerfile that contains detailed and reproducible instructions for building and configuring a software environment. This recipe might then be followed in a food processing plant to produce many identical pizzas, which are then frozen for re-sale. Similarly, the Dockerfile instructions might be followed to “build” a Docker image, which contains all the resulting files and settings. Finally, a customer might purchase a frozen pizza, bring it home, and reheat it in their own oven for consumption. In much the same way, a user might download or “pull” a Docker image from the internet and “run” it as a Docker container on their own computer to use the desired application. The benefit of this approach is that the customer/user doesn’t need to understand or even know about the recipe/Dockerfile instructions to enjoy the pizza/container!

I have played the role of chef or power user here and condensed the instructions from my previous blog post into a Dockerfile and built it into a Docker image that you can easily pull and run.

Before we dive into things, I’ll provide a brief overview of all the steps.

Check that our computer’s hardware supports CUDA
Install/update the NVIDIA graphics driver on Windows
Install Docker Desktop on Windows
Pull the preconfigured Docker image
Run the Docker image with GPU support
Log into RStudio Server and use audio.whisper

Check for CUDA Support

This post assumes that you are using the Windows operating system and that your computer’s graphics card supports CUDA. To check that this is the case, first look up your graphics card’s model number. An easy way to do this on Windows 10/11 is to click on the desktop search bar (bottom-left of the screen next to the windows icon) and type in “Device Manager.” Then click the arrow next to “Display adapters” and find your graphics card’s model name. On my computer, it says “NVIDIA GeForce RTX 2060.” Then go to this link and click the “CUDA-Enabled NVIDIA Quadro and NVIDIA RTX” and “CUDA-Enabled GeForce and TITAN Products” blocks to open their accordions. Then search for your graphics card’s model number (the left tables are for desktop cards and the right tables are for notebook cards). I found “GeForce RTX 2060” on the list under GeForce and TITAN Products with a compute capability of 7.5. Thus, my card is supported!

Install the Latest NVIDIA Graphics Driver

Download and install the latest graphics driver for your card from NVIDIA. The previous step described how to find your graphics card model name, which you’ll need to navigate to. You should choose the Game Ready version. Note that you should not install the CUDA toolkit on Windows as doing so may confuse things and lead to issues later on (as Docker will install the CUDA toolkit for WSL).

Install the Latest Version of Docker Desktop

Download and install the latest version of Docker Desktop for Windows. If you are unsure of whether you have AMD64 or ARM64, open a Command Prompt window and enter echo %PROCESSOR_ARCHITECTURE%. If the installer asks you whether you want to install or use WSL2 integration, select Yes.

Pull the wsl-cuda-whisper Docker image

Open the Docker Desktop application and click the Terminal button on the bottom. If it asks you to confirm/enable this, click yes.

In the Docker terminal, enter docker pull jmgirard/wsl-whisper.

This will take some time and disk space to download. If you are concerned at all about security, you can see the Dockerfile instructions here; it just installs rocker/tidyverse (i.e., Ubuntu, R, RStudio Server, and the tidyverse R packages), and then ffmpeg, CUDA Toolkit, and the audio.* R packages.

Run the wsl-cuda-whisper Docker image

Once the download is complete, you can run the image to access RStudio with audio.whisper and CUDA support. To run an image, we can use a command like docker run [options] image-name. We will use jmgirard/wsl-whisper as our image name, but we need to learn several options to get the most out of this.

--gpus all tells Docker to grant the container access to our NVIDIA graphics card, which is necessary for us to make use of CUDA.
-p 8787:8787 tells Docker to host the container’s RStudio Server on network port 8787, which will let us access it from a browser on Windows.
-e PASSWORD=[password] tells Docker to set the password for the RStudio Server to whatever we replace [password] with (e.g., abc). The username will be rstudio.
-v "[winpath]:/win" will make whatever folder we replace [winpath] with (e.g., C:\Users\jeffg) accessible to the container as /win (or whatever we put at the end).
--rm tells Docker to delete the container after it is closed, which can help save space in the long run
-it tells Docker to run the container “interactively” so that messages from the container will be shown in the Docker terminal, which can be helpful to tell when the RStudio server is ready for use

Putting this all together, we can enter this into the Docker terminal:

docker run --gpus all -p 8787:8787 -e PASSWORD=abc -v "C:\Users\jeffg:/win" --rm -it jmgirard/wsl-whisper

The Docker terminal will show the progress of R installing the audio.whisper package (unfortunately this can’t be done ahead of time). You will know the server is ready for use when the terminal says “TTY detected.”

Open the Container’s RStudio Server from Windows

In Docker Desktop, click the Containers tab on the left and click the “8787:8787” link next to your new container under the Ports column. This will open a web browser on Windows and direct it to http://localhost:8787 (or you could also just paste this into your favorite browser manually or even make a desktop shortcut for it). Enter rstudio as the Username and whatever you set as the Password above (e.g., abc in the example code).

Use RStudio Server to Run Whisper

In the R console, load the audio.whisper package and try it out on the JFK clip that took so long to process in the previous blog post. Note that there will be one important change to the commands from before. This time, when we load the model using the whisper() function, we will add the use_gpu = TRUE argument.

# Load the package from library
library(audio.whisper)

# Download or load from file the desired model (with GPU support)
model <- whisper("base", use_gpu = TRUE)

# Construct file path to example audio file in package data
jfk <- system.file(package = "audio.whisper", "samples", "jfk.wav")

# Run English transcription using the downloaded whisper model
out <- predict(model, newdata = jfk, language = "en")

# Print transcript
out$data

segment	segment_offset	from	to	text
1	0	00:00:00.000	00:00:07.600	And so my fellow Americans, ask not what your country can do for you,
2	0	00:00:07.600	00:00:10.600	ask what you can do for your country.

The results look good/the same as before, but check out the timing!

Instead of 20.988 minutes, it took 0.007 minutes…

out$timing
## $transcription_start
## [1] "2024-11-10 12:34:19 CST"
## 
## $transcription_end
## [1] "2024-11-10 12:34:20 CST"
## 
## $transcription_duration
## Time difference of 0.006911568 mins

If we wanted to save the transcript back to Windows, we could use the following approach. To save the entire list object generated by predict() for later use in R, we could use saveRDS(out, file = "/win/jfk.rds"). Or, to save just the transcript for human consumption, we could use write.csv(out$data, file = "jfk.csv").

Wrap-up

That wraps up this blog post. In the next part, I will discuss more practical aspects of using this technology. For example, I’ll talk about how to generate a list of audio/videos files on your hard drive (or elsewhere) and then iterate over them to create transcripts from many files all at once.

Part 4 coming soon…