AI Transcription from R using Whisper: Part 1

teaching

audio

Tutorial on Using AI Transcription

Author

Jeffrey Girard

Published

August 14, 2024

Introduction

In much of my work, I study how people communicate through verbal and nonverbal behavior. To study verbal behavior, it is often necessary to generate transcripts, which are written records of the words that were spoken. Transcriptions can be done manually (i.e., by a person) and assisted through the use of behavioral annotation software like ELAN or ANVIL or subtitle generation and editing software like Aegisub or Subtitld. However, new tools based on artificial intelligence (AI) can be much more efficient and scalable, albeit at some cost to accuracy.

In this blog post, I will provide a tutorial on how to set up and use OpenAI’s free Whisper model to generate automatic transcriptions of audio files (either recorded originally as audio or extracted from video files). I will first show you how to quickly install the audio.whisper R package and transcribe an example file. However, the processing will be very slow and we can do much, much better if we offload some of the work to a dedicated graphics card, such as an Nvidia card with CUDA. Enabling this takes some technical work, especially on Windows, but is worth the investment if you plan to process a lot of files. This technical work will be described in Part 2.

Note

Although the Whisper model comes from OpenAI, the approach described here will actually run it locally, which means your audio files will not need to be sent to any third parties. This makes it usable for private and sensitive (e.g., patient) data!

Quickstart (easy setup, slow processing)

Install dependencies

I assume you already have R (and probably an IDE like RStudio) installed. Open this up and install the development version of the audio.whisper package from github.

# Install remotes if you don't have it already
# install.packages("remotes") 

# Install audio.whisper from github
remotes::install_github("bnosac/audio.whisper")

Download whisper model

Load this new package and download one of the whisper models: "tiny", "base", "small", "medium", or "large-v3". Earlier entries on that list are smaller (to download and hold in RAM), faster, and less accurate whereas later entries are larger, slower, and more accurate. There are also English-only versions of all but the large model, which end in ".en" as in "base.en", and these may be more efficient if you know that all speech will be in English. You can learn more about these models via ?whisper_download_model. For this tutorial, we will go with the "base" model.

# Load package from library
library(audio.whisper)

# Download or load from file the desired whisper model
model <- whisper("base")
## whisper_init_from_file_with_params_no_state: loading model from 'C:/GitHub/affcomlab/posts/whisper2024/ggml-base.bin'
## whisper_model_load: loading model
## whisper_model_load: n_vocab       = 51865
## whisper_model_load: n_audio_ctx   = 1500
## whisper_model_load: n_audio_state = 512
## whisper_model_load: n_audio_head  = 8
## whisper_model_load: n_audio_layer = 6
## whisper_model_load: n_text_ctx    = 448
## whisper_model_load: n_text_state  = 512
## whisper_model_load: n_text_head   = 8
## whisper_model_load: n_text_layer  = 6
## whisper_model_load: n_mels        = 80
## whisper_model_load: ftype         = 1
## whisper_model_load: qntvr         = 0
## whisper_model_load: type          = 2 (base)
## whisper_model_load: adding 1608 extra tokens
## whisper_model_load: n_langs       = 99
## whisper_model_load:      CPU buffer size =   147.46 MB
## whisper_model_load: model size    =  147.37 MB
## whisper_init_state: kv self size  =   16.52 MB
## whisper_init_state: kv cross size =   18.43 MB
## whisper_init_state: compute buffer (conv)   =   14.86 MB
## whisper_init_state: compute buffer (encode) =   85.99 MB
## whisper_init_state: compute buffer (cross)  =    4.78 MB
## whisper_init_state: compute buffer (decode) =   96.48 MB

Note

Note that the larger models may take a while to download, so if you get an error that the download took longer than permitted, you can temporarily allow more time via: options(timeout = 300).

Transcribe example file

The package comes with an example audio file in the proper format, which contains 11 seconds of a speech by John F. Kennedy Jr. Let’s load it from file using system.file() and then transcribe it using predict().

# Construct file path to example audio file in package data
jfk <- system.file(package = "audio.whisper", "samples", "jfk.wav")

# Run English transcription using the downloaded whisper model
out <- predict(model, newdata = jfk, language = "en")

# Print transcript
out$data

segment	segment_offset	from	to	text
1	0	00:00:00.000	00:00:11.000	And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

The results look good! But we can see how long this took by digging into the output object.

# Examine the time elapsed to process this audio
out$timing
## $transcription_start
## [1] "2024-08-15 12:30:48 CDT"
## 
## $transcription_end
## [1] "2024-08-15 12:51:48 CDT"
## 
## $transcription_duration
## Time difference of 20.98786 mins

Yikes, 21 minutes to process just 11 seconds of audio. That’s motivation to work on the CUDA version to speed things up. But before we move on to that, I’ll first show you how to extract audio from a video file and convert it to the format that Whisper wants.

Extract and format audio

Download the example mlk.mp4 video file, which contains 12 seconds of a speech by Martin Luther King, Jr. This video contains an audio stream in AAC format with a sampling rate of 44.1 kHz. However, whisper requires audio files in WAV format with a sampling rate of 16 kHz. We can extract and convert it in one step using the av_audio_convert() function from the av package.

# Install av package if you don't have it already
# install.packages("av")

# Load package from library
library(av)

# Extract and convert audio
av_audio_convert(
  "mlk.mp4", 
  output = "mlk.wav", 
  format = "wav", 
  sample_rate = 16000
)
## [1] "C:\\GitHub\\affcomlab\\posts\\whisper2024\\mlk.wav"

Note that the process would have been identical if this had been an audio file in a different format rather than a video file - you would just replace the .mp4 file with the audio file (e.g., .mp3). Now let’s transcribe this and verify that our conversion worked.

# Run English transcription using the downloaded whisper model
out2 <- predict(model, newdata = "mlk.wav", language = "en")

# Print transcript
out2$data

segment	segment_offset	from	to	text
1	0	00:00:00.000	00:00:02.000	I have a dream.
2	0	00:00:02.000	00:00:12.000	But one day, this nation will rise up, live up the true meaning of its creed.

Not perfect (swapped “that” for “but” and omitted an “and”) but pretty good. And this only the base model - it might do better with a larger model, but for time’s sake I’ll leave that until after we get CUDA working in Part 2.

Continue to Part 2 »