Member-only story
Multilingual Vision Captioning: A Multi-Model Multimodal Approach to Image and Video Captioning and Translation
Using a combination of Meta’s Llama 3.2 11B Vision Instruct, Facebook’s 600M NLLB-200, and LLaVA-Next-Video 7B models to produce multilingual image and video captions, descriptive tags, and sentiment analyses.
Video: The Coca-Cola Co., 1971, “Hilltop” commercial featuring the famous “I’d Like to Buy the World a Coke” song.
Descriptive Tags: music, singing, group, harmony, joy, happiness, celebration, youth, love, friendship, unity, nature, outdoors, sunny, vintage, retro, 1960s, fashion, soda, Coca-Cola
Natural Language Description: “The video features a group of young people standing together, singing, and smiling at the camera. The scene is set in a brightly lit outdoor area, with a clear blue sky and trees in the background. The group consists of men and women, dressed in colorful, casual clothing. The camera angle is slightly elevated, capturing the group from the chest up. The audio is clear, with the voices of the singers audible and the background noise minimal. The…