Member-only story

Multilingual Vision Captioning: A Multi-Model Multimodal Approach to Image and Video Captioning and Translation

Using a combination of Meta’s Llama 3.2 11B Vision Instruct, Facebook’s 600M NLLB-200, and LLaVA-Next-Video 7B models to produce multilingual image and video captions, descriptive tags, and sentiment analyses.

Gary A. Stafford
34 min readOct 8, 2024

Video: The Coca-Cola Co., 1971, “Hilltop” commercial featuring the famous “I’d Like to Buy the World a Coke” song.

Descriptive Tags: music, singing, group, harmony, joy, happiness, celebration, youth, love, friendship, unity, nature, outdoors, sunny, vintage, retro, 1960s, fashion, soda, Coca-Cola

Natural Language Description: The video features a group of young people standing together, singing, and smiling at the camera. The scene is set in a brightly lit outdoor area, with a clear blue sky and trees in the background. The group consists of men and women, dressed in colorful, casual clothing. The camera angle is slightly elevated, capturing the group from the chest up. The audio is clear, with the voices of the singers audible and the background noise minimal. The

--

--

Gary A. Stafford
Gary A. Stafford

Written by Gary A. Stafford

Area Principal Solutions Architect @ AWS | 10x AWS Certified Pro | Polyglot Developer | DataOps | GenAI | Technology consultant, writer, and speaker

Responses (1)