Sitemap

Automatic Speech Recognition (ASR) and Speech Translation Models for Batch Video Transcription and Machine Translation

Master the use of compact, open-weight transformer models tailored for specific tasks to efficiently transcribe and translate speech from video content.

17 min readApr 16, 2025

--

Press enter or click to view image in full size
Image generated with Amazon Nova Canvas Foundation Model

In addition to rich visuals, video content features a diverse array of audio elements — ambient sounds, Foley effects, sound effects (SFX), music, dialogue, narration, singing, environmental noise, and even purposeful moments of silence. Collectively, these sounds shape the narrative, reinforce the emotional tone, and immerse the audience in a multidimensional sensory experience that is essential for storytelling and viewer engagement.

Example of multiple audio components that help to with storytelling and enriching the sensory experience

Introduction

This post will explore using smaller task-specific open-weight models, such as OpenAI’s Whisper transformer model, to transcribe and translate speech from video collections. We will then translate the transcriptions into multiple languages using Facebook’s NLLB-200 machine translation model. Once transcribed, this data can be merged with the source video for more detailed analysis and understanding using generative AI.

Demonstration of this post

Below is an example of a video from the dataset used for this post and the resulting transcription and translations using the models.

Winning Isn’t For Everyone | Am I A Bad Person? | Nike

English Transcription (OpenAI Whisper Medium):
Am I a bad person? Tell me, am I? I’m single-minded. I’m deceptive. I’m obsessive. I’m selfish. Does that make me a bad person? Am I a bad person? Am I? I have no empathy. I don’t respect you. I’m never satisfied. I have an obsession with power. I’m irrational. I have zero remorse. I have no sense of compassion. I’m delusional. I’m maniacal. You think I’m a bad person? Tell me. Tell me. Tell me. Tell me. Am I? I think I’m better than everyone else. I want to take what’s yours and never give it back. What’s mine is mine, and what’s yours is mine. Am I a bad person? Tell me, am I? Does that make me a bad person? Tell me, does it?

German Translation (Facebook NLLB-200 distilled 1.3B variant):
Bin ich ein schlechter Mensch? Sag mir, bin ich? Ich bin einseitig. Ich bin betrügerisch. Ich bin besessen. Ich bin egoistisch. Macht mich das zu einem schlechten Menschen? Bin ich ein schlechter Mensch? Bin ich das? Ich habe kein Mitgefühl. Ich respektiere dich nicht. Ich bin nie zufrieden. Ich habe eine Besessenheit von Macht. Ich bin irrational. Ich habe keine Reue. Ich habe keinen Mitgefühl. Ich bin wahnhaft. Ich bin verrückt. Glaubst du, ich sei ein schlechter Mensch? Sag mir. Sag mir. Sag mir. Sag mir. Sag mir. Sag mir. Bin ich? Ich denke, ich bin besser als alle anderen. Ich will nehmen, was mir gehört und es nie zurückgeben. Was mir gehört, ist mir, und was mir gehört, ist dir. Ich bin ein schlechter Mensch? Sag mir, ich? Das macht mich zu einem schlechten Menschen, sag es mir?

German translation of English transcription

Simplified Chinese (Facebook NLLB-200 distilled 1.3B variant):
我是一个坏人吗?告诉我,我是吗?我是单心的.我是欺骗的.我是痴迷的.我是自私的.这是否让我成为一个坏人?我是一个坏人吗?我是吗?我没有同情心.我不尊重你.我从来不满意.我对权力有着痴迷.我是不理性.我没有悔恨.我没有同情感.我妄想.我是疯狂的.你认为我是个坏人吗?告诉我.告诉我.告诉我.告诉我.告诉我.我是吗?我认为我比其他人好.我想拿走属于你的东西,永远不要把它还给.属于我的东西是,属于我的东西是你吗?告诉我?这是否让我成为一个坏人?告诉我?

Chinese translation of English transcription

Multimodal Misunderstanding

A common misunderstanding is that multimodal models can simultaneously process information from various sources, such as text, images, audio, and video. Most multimodal models that understand video cannot comprehend the accompanying audio, let alone the different audio elements. Multimodal models are most often Vision Language Models (VLMs), generative models that take image and text inputs and generate text outputs. Even leading multimodal models like Anthropic Claude 3.7 Sonnet and Amazon Bedrock Nova Lite and Pro multimodal models have limitations based on their current training that restricts them from processing and understanding video content solely based on the visual information in the video. These models do not possess the capability to comprehend the audio components of a video.

Spatial and Temporal Reasoning

An additional feature lacking in most multimodal models is spatial and temporal reasoning, which is the ability to understand and process visual and audio information in terms of both its location (spatial) and its changes over time (temporal).

Audio-Visual LLM

Even current so-called Any-to-Any Models, which demonstrate strong performance across most modalities, lack audio-visual understanding and spatial and temporal reasoning. Audio-video understanding requires what has been referred to as an Audio-Visual LLM, a multimodal large language model that takes both visual and auditory inputs for holistic video understanding.

Press enter or click to view image in full size
Audio-visual understanding example with Qwen2.5-Max on chat.qwen.ai

For more information on this model type, see the paper, Audio-Visual LLM for Video Understanding, by Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. These models include VideoLLaMA and VideoLLaMA 2 from the Language Technology Lab at Alibaba DAMO Academy and Qwen2.5-Omni, part of the Qwen model family built by Alibaba Cloud.

Press enter or click to view image in full size
Illustration courtesy Video-LLaMA Paper
Press enter or click to view image in full size
Illustration courtesy Qwen2.5-Omni Technical Report

Types of Audio Understanding

Multiple machine learning and deep learning techniques exist to comprehend a video’s audio elements, including:

  • Automatic speech recognition (ASR; aka machine transcription or speech-to-text (STT)): Converts a speech signal to text, often used for virtual assistants like Siri and Alexa
  • Machine translation (MT): Convert text from one language to another
  • Text-to-speech (TTS): Generate natural sounding speech given text input
  • Speech-to-speech translation (STST or S2ST): Translating speech from one language into speech in a different language (ASR+MT+TTS)
  • Audio classification: Assign a label or class to a given audio for use cases including speaker identification, emotion recognition, language identification, and command recognition or keyword spotting
  • Music genre classification: A subcategory of audio classification specific to music
Press enter or click to view image in full size
Illustration courtesy Advanced Engineering Informatics: Conversational artificial intelligence in the AEC industry

Speech Recognition and Translation Models

This post will use the encoder-decoder (sequence-to-sequence) OpenAI Whisper family of transformer models. Whisper is a state-of-the-art (SoTA) model for multilingual automatic speech recognition (ASR) and any-to-English speech translation.

According to Hugging Face, by default, Whisper performs the task of speech transcription, where the target text (transcription) language is the same as the source audio language. Whisper predicts the language of the source audio automatically. If the source audio language is known a-priori, it can be passed as an argument. To perform speech translation from any-to-English, set the task to translate.

Press enter or click to view image in full size
Illustration courtesy OpenAI Whisper GitHub Repository

I would suggest testing any or all of the following Whisper-based models to compare and contrast the quality of the resulting transcriptions:

Machine Translation

Using the Whisper-based models, we can generate transcriptions in the same language as the audio source or translate the transcriptions to English. To translate the transcriptions to a language other than English or the source language, we can use models like facebook/nllb-200-distilled-1.3B, Facebook’s (now Meta) distilled 1.3B parameter variant of the NLLB-200 mixture-of-experts (MoE) machine translation model to translate into multiple languages. According to Hugging Face, NLLB-200 allows for single-sentence translation between 200 languages.

The Flores-200 dataset is recommended for the evaluation of NLLB-200. The FLORES+ evaluation benchmark for the multilingual machine translation repository on Hugging Face contains a list of all 200 language codes, which we will use to indicate the languages into which we want to translate the English descriptions, such as French (fra_Latn), Spanish (spa_Latn), or Hindi (hin_Deva).

On my system, the facebook/nllb-200-distilled-1.3B model takes 5.5 GB of space and consumes approximately 7.1 GB of the available dedicated 16 GB of GPU memory.

Press enter or click to view image in full size
Dedicated GPU memory usage while performing inference with the NLLB-200 model

Video Dataset

For this post, I assembled a dataset of 28 popular television commercials, all in MP4 format, ranging in length from 25 to 111 seconds and 1.8 to 41.7 MB in size. Most videos are in English, with a few in Hindi, French, and Spanish to test the model’s transcription and any-to-English translation capabilities. There are many ways to obtain videos to build a dataset for testing purposes; videos are not included with the source code.

Press enter or click to view image in full size
Dataset of television commercials used in post

Model Hosting

Optionally, I downloaded the models mentioned in this post in advance using the huggingface-cli. The huggingface_hub library allows you to interact with Hugging Face Hub. The huggingface_hub Python package comes with a built-in CLI called huggingface-cli. This tool lets you interact directly with Hugging Face Hub from a terminal. Each model is multiple GBs in size and can take several minutes each to download, depending on your Internet connection. A free Hugging Face account and User Access Token are required for access. If you do not download the models in advance, they will be automatically downloaded into the local cache the first time the application loads them.

python -m pip install "huggingface_hub[cli]" --upgrade

huggingface-cli login --token <your_hf_token> --add-to-git-credential

huggingface-cli download distil-whisper/distil-large-v3.5
huggingface-cli download openai/whisper-tiny
huggingface-cli download openai/whisper-small
huggingface-cli download openai/whisper-medium
huggingface-cli download openai/whisper-large-v3-turbo
huggingface-cli download openai/whisper-large-v3

huggingface-cli download facebook/nllb-200-distilled-1.3B
Press enter or click to view image in full size
Downloading models can take several minutes depending on your Internet bandwidth
Press enter or click to view image in full size
All the models used in this post took a total of 51.3 GB on my system

Inference

Hardware

Many options exist for hosting open-weight models for inference, both locally and in the Cloud. I am hosting the models on an Intel Core i9 Windows 11 system with 16 GB of GDDR6X memory (VRAM) and NVIDIA fourth-generation Tensor Cores for this post.

Software

The code for this post was tested on Python 3.12.9, PyTorch (2.6.0+cu126), CUDA (12.6), and Flash Attention 2. According to NVIDIA, CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model that NVIDIA invented. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). FlashAttention is an algorithm that speeds up attention and reduces its memory footprint without approximation. FlashAttention-2 aims to achieve even faster attention with better parallelism and work partitioning.

Press enter or click to view image in full size
Use the check_gpu_config script to check for several important GPU-related features

Source Code

The complete source code for this project is open source and freely available on GitHub.

Prerequisites

To follow along with this post, you must install FFmpeg. All of the latest installation packages are available from FFmpeg or on GitHub.

Press enter or click to view image in full size
FFmpeg must be installed to run examples in this post

Python Virtual Environment Setup

Next, we will set up a fresh Python virtual environment and install the required packages for this post. The included requirements.txt file contains the following dependencies:

--extra-index-url https://download.pytorch.org/whl/cu126
torch==2.6.0
torchvision==0.21.0
torchaudio==2.6.0

transformers
accelerate
triton-windows
ffmpeg-python
datasets
soundfile
librosa

Create the Python virtual environment and install the required packages for this post:

python -m venv .venv
.venv\Scripts\activate

python -m pip install pip --upgrade
python -m pip install -r requirements.txt --upgrade
python -m pip install flash-attn --no-build-isolation --upgrade
Press enter or click to view image in full size
Installing the official implementation of FlashAttention and FlashAttention-2

Examining the Code

The GitHub project contains a single Python script, ars_demo.py. There are four primary functions in the script. First, setup_environment() creates the local environment with the correct video, audio, and output directories. Second, audio_extractor.extract_audio_from_videos() extracts MP3-format audio files from MP4-format video files. Then, Hugging Face’s AudioFolder dataset builder creates an audio Dataset from the audio files. Third, the audio dataset is passed to generate_transcriptions(), which creates all the transcriptions. Lastly, the transcriptions are passed to translate.translate_text() for transcription translation.

"""
Video Transcription using ASR (Automatic Speech Recognition) to transcribe audio from video files.
Author: Gary A. Stafford
Date: 2025-04-14
"""

# Standard library imports
import logging
import os
import time
import json
import shutil
import sys

# Third-party library imports
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers.utils import is_flash_attn_2_available
from datasets import load_dataset
from transformers.pipelines.pt_utils import KeyDataset

# Local application/library imports
import utilities.audio_extractor as audio_extractor
import utilities.translator as Translator

# Constants
VIDEO_DIRECTORY = "videos"
AUDIO_DIRECTORY = "audio"
OUTPUT_DIRECTORY = "output"

# Configure logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)


def main():
"""
Main function to set up the environment, extract audio from videos, transcribe audio, translate text, and save results.
Args:
None
Returns:
None
"""

# Choose the model to use for transcription from command line (0-5)
model_choice = (int(sys.argv[1]) if len(sys.argv) > 1 else 3)
model = get_model_list()[model_choice] # Change the index to select a different model
model_id = model["model_id"]
output_file = os.path.join(
OUTPUT_DIRECTORY, f"transcriptions_{model['model_name']}.json"
)

# Set up the environment
logging.info("Setting up the environment...")
setup_environment(output_file)

# Extract audio from videos in the specified directory
logging.info("Extracting audio from videos...")
start = time.time()
video_count = audio_extractor.extract_audio_from_videos(
VIDEO_DIRECTORY, AUDIO_DIRECTORY
)
if video_count == 0:
logging.error("No valid video files found in the directory.")
return
end = time.time()
total_extract_audio_time_sec = round(end - start, 2)
logging.info(f"Extracted audio from {video_count} videos.")
logging.info(f"Total audio extraction time: {total_extract_audio_time_sec} seconds")

# Load the audio dataset
logging.info("Loading audio dataset...")
# https://huggingface.co/docs/datasets/audio_load#audiofolder
audio_dataset = load_dataset("audiofolder", data_dir=AUDIO_DIRECTORY, split="train")

# Transcribe the audio dataset
logging.info("Transcribing audio dataset...")
start = time.time()
results = generate_transcriptions(model_id, audio_dataset)
end = time.time()
total_transcription_time_sec = round(end - start, 2)
logging.info(f"Total transcription time: {total_transcription_time_sec} seconds")

# Translate the transcriptions
logging.info(f"Translating transcriptions...")
start = time.time()
translate = Translator.Translator()
for result in results:
translation_german = translate.translate_text(
result["text"], language="deu_Latn"
)
result["translation_german"] = translation_german

translation_chinese = translate.translate_text(
result["text"], language="zho_Hans"
)
result["translation_chinese"] = translation_chinese
end = time.time()
total_translation_time_sec = round(end - start, 2)
logging.info(f"Total translation time: {total_translation_time_sec} seconds")

# Save the results
logging.info("Saving results...")
all_results = {}
all_results["results"] = results
all_results["stats"] = {
"model": model_id,
"total_video": video_count,
"total_extract_audio_time_sec": total_extract_audio_time_sec,
"total_transcription_time_sec": total_transcription_time_sec,
"total_translation_time_sec": total_translation_time_sec,
"average_total_time_per_video_sec": round(
(
total_extract_audio_time_sec
+ total_transcription_time_sec
+ total_translation_time_sec
)
/ video_count,
2,
),
}

with open(output_file, "w", encoding="utf-8") as f:
json.dump(all_results, f, indent=4, ensure_ascii=False)

logging.info(f"Results saved to {output_file}")
logging.info("Done!")


def setup_environment(output_file):
"""
Sets up the environment by creating necessary directories and deleting existing ones.
Args:
None
Returns:
None
"""

if not os.path.exists(VIDEO_DIRECTORY):
logging.error(f"Video directory {VIDEO_DIRECTORY} does not exist")
return

if os.path.exists(AUDIO_DIRECTORY):
logging.info(f"Deleting existing audio directory {AUDIO_DIRECTORY}...")
shutil.rmtree(AUDIO_DIRECTORY)
logging.info(f"Creating audio directory {AUDIO_DIRECTORY}...")
os.makedirs(AUDIO_DIRECTORY)

if not os.path.exists(OUTPUT_DIRECTORY):
logging.info(f"Creating output directory {os.path.dirname(output_file)}...")
os.makedirs(OUTPUT_DIRECTORY, exist_ok=True)


def get_model_list():
"""
Returns a list of available models for transcription.
Returns:
List[Dict]: A list of dictionaries containing model IDs and names.
"""

models = [
{
"model_id": "openai/whisper-tiny",
"model_name": "openai-whisper-tiny",
},
{
"model_id": "openai/whisper-small",
"model_name": "openai-whisper-small",
},
{
"model_id": "openai/whisper-medium",
"model_name": "openai-whisper-medium",
},
{
"model_id": "openai/whisper-large-v3-turbo",
"model_name": "openai-whisper-large-v3-turbo",
},
{
"model_id": "openai/whisper-large-v3",
"model_name": "openai-whisper-large-v3",
},
{
"model_id": "distil-whisper/distil-large-v3.5",
"model_name": "distil-whisper-distil-large-v3.5",
},
]

return models


def generate_transcriptions(model_id, audio_dataset):
"""
Transcribes the audio from a dataset of audio files using a pre-trained model.
Args:
audio_dataset (Dataset): A dataset containing audio files to be transcribed.
Returns:
List[Dict]: A list of transcription results for each audio file.
"""

# Load the model and processor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
attn_implementation = "flash_attention_2" if is_flash_attn_2_available() else "sdpa"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
attn_implementation=attn_implementation,
)

# Optional model generation_config (here or in generate_kwargs below)
# model.generation_config.forced_decoder_ids = None
# model.generation_config.input_ids = model.generation_config.forced_decoder_ids
# model.generation_config.language = "hindi"
# model.generation_config.task = "transcribe" # options: translate or transcribe
logging.debug(model.generation_config)

model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# Create the pipeline
asr_pipeline = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
batch_size=16, # batch size for inference - set based on your device
torch_dtype=model.dtype,
device=model.device,
return_timestamps=True,
generate_kwargs={
# "forced_decoder_ids": None,
# "language": "english",
# "max_new_tokens": 128
"task": "transcribe",
},
)

results = []

# Process the audio files in the dataset
for idx, out in enumerate(
asr_pipeline(
KeyDataset(audio_dataset, "audio"),
)
):
audio_file = audio_dataset[idx]["audio"]["path"].split("\\")[-1]
logging.info(f"Transcribing/translating {audio_file}...")
out["audio_file"] = audio_file
logging.debug(audio_dataset[idx])
logging.debug(json.dumps(out, indent=4))
logging.info(f"Transcription/translation result: {out['text']}")
results.append(out)
return results


if __name__ == "__main__":
main()

The utilities.audio_extractor module, audio_extractor.py, handles the extraction of the audio from the video for transcription. The extract_audio_from_videos() function extracts MP3-format audio files from a directory of MP4-format video files.

"""
Extracts audio from video files using FFmpeg.
Author: Gary A. Stafford
Date: 2025-04-15
"""

import logging
import os
import ffmpeg


# Configure logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)


def extract_audio_from_videos(video_directory: str, audio_directory: str) -> int:
"""
Extracts audio from video files in the specified directory and saves them as MP3 files.
Args:
None
Returns:
int: The number of videos processed.
"""

video_count = 0
for video_file in os.listdir(video_directory):
video_path = os.path.join(video_directory, video_file)
if not video_path.lower().endswith((".mp4")):
logging.warning(f"Skipping {video_file} - not a valid video file")
continue
logging.info(f"Extracting audio from {video_file}...")
video_count += 1
audio_file = (
f"{audio_directory}\\{video_path.split('\\')[-1].replace('.mp4', '.mp3')}"
)
ffmpeg.input(video_path).output(
audio_file, ac=1, ar=16_000, loglevel="quiet"
).overwrite_output().run()

return video_count

The audio sampling rate is set to 16000 Hz, and the set number of audio channels is set to 1 (mono). The resulting audio files have a bitrate of 24 kbps.

Press enter or click to view image in full size
Details of extracted audio file using ffprobe

Audio extraction is done as a batch function using FFmpeg. Extracted MP3-based audio files are output to a separate directory.

Press enter or click to view image in full size
Audio files extracted from videos using FFmpeg

The utilities.translator class, translator.py, handles the transcription translations. The translate_text() method translates text to the requested language.

"""
Translate text using a pre-trained sequence-to-sequence language model.
https://huggingface.co/facebook/nllb-200-distilled-1.3B
Author: Gary A. Stafford
Date: 2025-04-15
"""

import logging

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers.utils import is_flash_attn_2_available

MODEL_ID = "facebook/nllb-200-distilled-1.3B"
TEMPERATURE = 0.2
MAX_NEW_TOKENS = 300
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Configure logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)


class Translator:
def __init__(self):
self.model = (
AutoModelForSeq2SeqLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
attn_implementation=(
"flash_attention_2" if is_flash_attn_2_available() else "sdpa"
),
)
.to(DEVICE)
.eval()
)

self.tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

def translate_text(self, text, language="eng_Latn") -> str:
"""
Translates the given text to the specified language.
Args:
text (str): The text to be translated.
language (str): The language code to translate the text into.
Default is "eng_Latn" (English in Latin script).
Returns:
str: The translated text.
"""

logging.info(f"Translating text to: {language}...")

inputs = self.tokenizer(
text, return_tensors="pt", padding=True, truncation=True
).to(DEVICE)

translated_tokens = self.model.generate(
**inputs,
forced_bos_token_id=self.tokenizer.convert_tokens_to_ids(language),
max_length=MAX_NEW_TOKENS,
do_sample=True,
temperature=TEMPERATURE,
)
response = self.tokenizer.batch_decode(
translated_tokens, skip_special_tokens=True
)[0]

return response

Results

Running the asr_demo.py script results in a JSON file being written to the output directory, similar to the example below. Each video has a complete transcription in the source language (e.g., Spanish), a transcription chunked into timestamped blocks, and the requested translations (e.g., German and Chinese). There are also stats on the overall transcription and translation process results, with the Whisper model used, video count, and overall timings.

{
"results": [
{
"text": " Hoy te ves deliciosa. ¿Deliciosa? ¿Por qué tengo la impresión que te gusto solo porque soy un chocolate M&M's? Eso no es cierto. Lo que más me importa de ti es el interior. Mi interior es puro chocolate con leche. Y eso es exactamente lo que me encanta de ti. Y yo pensando que te interesaba por mi cerebro. ¿Que también es de chocolate? Ay, cómo eres guapo.",
"chunks": [
{
"timestamp": [
0.06,
2.16
],
"text": " Hoy te ves deliciosa."
},
{
"timestamp": [
2.44,
3.0
],
"text": " ¿Deliciosa?"
},
{
"timestamp": [
3.36,
7.0
],
"text": " ¿Por qué tengo la impresión que te gusto solo porque soy un chocolate M&M's?"
},
{
"timestamp": [
7.3,
8.26
],
"text": " Eso no es cierto."
},
{
"timestamp": [
9.24,
11.12
],
"text": " Lo que más me importa de ti es el interior."
},
{
"timestamp": [
11.64,
13.88
],
"text": " Mi interior es puro chocolate con leche."
},
{
"timestamp": [
14.5,
16.66
],
"text": " Y eso es exactamente lo que me encanta de ti."
},
{
"timestamp": [
17.38,
20.28
],
"text": " Y yo pensando que te interesaba por mi cerebro."
},
{
"timestamp": [
20.84,
22.38
],
"text": " ¿Que también es de chocolate?"
},
{
"timestamp": [
24.82,
26.14
],
"text": " Ay, cómo eres guapo."
}
],
"audio_file": "William Levy - M&M's commercial in Spanish with English subtitles.mp3",
"translation_german": "Heute siehst du köstlich aus. Köstlich? Warum habe ich den Eindruck, dass du mich magst, nur weil ich ein Schokolademelkchen bin? Das ist nicht wahr. Was mir am meisten am Herzen liegt, ist mein Inneres. Mein Inneres ist reine Schokolade mit Milch. Und genau das liebe ich an dir. Und ich dachte, du interessierst dich für mein Gehirn.",
"translation_chinese": "今天你看起来很好吃. 你好吃吗? 为什么我觉得你喜欢我只是因为我是巧克力M&M?这不是真的. 我最关心的是你的内心. 我的内心是纯巧克力和牛奶."
},
{
"text": " Am I a bad person? Tell me. Am I? I'm single-minded. I'm deceptive. I'm obsessive. I'm selfish. Does that make me a bad person? Am I a bad person? Am I? I have no empathy. I don't respect you. I'm never satisfied. I have an obsession with power. I'm irrational. I have zero remorse. I have no sense of compassion. I'm delusional. I'm maniacal. You think I'm a bad person? Tell me. Tell me. Tell me. Tell me. Am I? I think I'm better than everyone else. I want to take what's yours and never give it back. What's mine is mine, and what's yours is mine. Am I a bad person? Tell me, am I? Does that make me a bad person? Tell me, does it?",
"chunks": [
{
"timestamp": [
0.0,
2.0
],
"text": " Am I a bad person?"
},
{
"timestamp": [
67.0,
67.1
],
"text": " Tell me. Am I? I'm single-minded. I'm deceptive. I'm obsessive. I'm selfish. Does that make me a bad person? Am I a bad person? Am I? I have no empathy. I don't respect you. I'm never satisfied. I have an obsession with power. I'm irrational. I have zero remorse. I have no sense of compassion. I'm delusional. I'm maniacal. You think I'm a bad person? Tell me. Tell me. Tell me. Tell me. Am I? I think I'm better than everyone else. I want to take what's yours and never give it back. What's mine is mine, and what's yours is mine."
},
{
"timestamp": [
70.1,
81.6
],
"text": " Am I a bad person? Tell me, am I?"
},
{
"timestamp": [
86.8,
null
],
"text": " Does that make me a bad person? Tell me, does it?"
}
],
"audio_file": "Winning Isn't For Everyone Am I a Bad Person Nike.mp3",
"translation_german": "Bin ich ein schlechter Mensch? Sag mir. Bin ich? bin ich zielstrebig. bin ich trügerisch. bin ich besessen. bin ich egoistisch. Macht mich das zu einem schlechten Menschen? bin ich ein schlechter Mensch? bin ich es? Ich habe kein Mitgefühl. Ich respektiere dich nicht. bin nie zufrieden. habe eine Besessenheit von Macht. bin ich irrational. habe keine Reue. habe keinen Mitgefühl. bin ich wahnhaft. bin ich verrückt. meinst du, ich sei ein schlechter Mensch? sag mir. sag mir. sag mir. sag mir. sage mir. bin ich? ich denke, ich bin besser als alle anderen. ich will nehmen, was dir gehört, und es nie zurückgeben. was mir gehört, ist mir, und was dir gehört, ist mir. bin ich ein schlechter Mensch? sage mir? macht mich das zu einem schlechten Menschen? sag es mir?",
"translation_chinese": "我是一个坏人吗?告诉我.我吗?我是专注的.我是欺骗的.我是痴迷的.我是自私的.这是否使我成为一个坏人?我是一个坏人吗?我吗?我没有同情心.我不尊重你.我从来不满足.我对权力痴迷.我是不理性.我没有悔恨.我没有同情心.我妄想.我是疯狂的.你认为我是一个坏人吗?告诉我.告诉我.告诉我.告诉我.告诉我.告诉我.我吗?我认为我比其他人好.我想拿走属于你的东西,永远不要把它还给.属于我的东西是我的,属于你的东西是我的.我是一个坏人吗?告诉我?这是否让我成为一个坏人?告诉我.告诉我.告诉我.告诉我.告诉我.告诉我.告诉我.告诉我.我吗?我吗?我认为我比其他人好.我想拿走属于你的东西,永远不要把它还给我.我的东西是我的,你的东西是我的.我是一个坏人吗?告诉我?这让我成为一个坏人吗?告诉我吧?"
}
],
"stats": {
"model": "openai/whisper-large-v3",
"total_video": 28,
"total_extract_audio_time_sec": 3.63,
"total_transcription_time_sec": 45.41,
"total_translation_time_sec": 125.45,
"average_total_time_per_video_sec": 6.23
}
}

FFmpeg took only 4.1 seconds to extract MP3 audio files from 28 television commercials and write them to disk.

Press enter or click to view image in full size
Extracting MP3 audio files from MP4-format video files using FFmpeg

The openai/whisper-large-v3 model took approximately 11 GB of GPU memory to run the transcriptions. The total transcription time for the 28 television commercials in the dataset was 49.78 seconds on my system, an average of 1.78 seconds per transcription.

Press enter or click to view image in full size
Dedicated GPU memory usage while performing inference with the openai/whisper-large-v3 model

Compare openai/whisper-large-v3 11 GB of GPU memory usage to openai/whisper-tiny, which only consumed 2 GB of GPU memory.

Press enter or click to view image in full size
GPU memory and utilization with openai/whisper-tiny

Using the facebook/nllb-200-distilled-1.3B model, the entire translation process took 213.32 seconds for two translations per audio file or 56 total translations, an average of 3.80 seconds per translation.

Press enter or click to view image in full size
Translating transcriptions from source languages to German and Chinese

Conclusion

In this post, we learned how to generate transcriptions and translations of videos using a combination of smaller task-specific open-weight models. We used OpenAI’s encoder-decoder (sequence-to-sequence) Whisper transformer model to transcribe and translate speech from video. We then translated the transcriptions into our chosen languages using Facebook’s NLLB-200 machine translation model. Once transcribed, this data can be merged with the source video for more detailed analysis and understanding using generative AI.

If you are not yet a Medium member and want to support authors like me, please sign up here: https://garystafford.medium.com/membership.

This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, images, logos, and brands are the property of their respective owners.

--

--

Gary A. Stafford
Gary A. Stafford

Written by Gary A. Stafford

Area Principal Solutions Architect @ AWS | 14x AWS Certified Gold Jacket | Polyglot Developer | DataOps | GenAI | Technology consultant, writer, and speaker

No responses yet