Frame-by-Frame Multi-object Tracking-Guided Video Captioning

Luo, Hui Lan, Cai, Xia and Shark, Lik orcid iconORCID: 0000-0002-9156-2003 (2025) Frame-by-Frame Multi-object Tracking-Guided Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology . ISSN 1051-8215

[thumbnail of AAM]
Preview
PDF (AAM) - Accepted Version
Available under License Creative Commons Attribution.

29MB

Official URL: https://doi.org/10.1109/TCSVT.2025.3541965

Abstract

Video captioning through deep learning presents a multifaceted challenge that encompasses the extraction of complex spatio-temporal visual features and the synthesis of meaningful natural language descriptions. Most of the existing deep learning models can be broadly grouped as either convolution-based or transformer-based encoder-decoder networks, with video captions generated from features encoded at the pixel level for the former, and from features encoded at grid, frame, or video levels depending on encoder complexity for the latter. This paper advocates frame-level features as a more balanced and compact representation for fast caption generation, and introduces the Tracking-guided Information Augmentation for Captioning (Track4Cap) model, which integrates tracking-guided information augmentation to enhance frame-level features without relying on complex architectures or additional data modalities. Specifically, Track4Cap employs the Frame-by-Frame Multi-object Tracking module (FMoT) to identify the most relevant objects in the input video and the Object Relation Encoder (ORE) to model inter-object relationships as supplementary high-level cues for caption generation. By avoiding time-consuming end-to-end training and leveraging compact representations, Track4Cap achieves computational efficiency while improving captioning performance. Extensive experiments on two commonly used benchmark datasets demonstrate that Track4Cap not only achieves faster inference times but also outperforms state-of-the-art convolution-based and transformer-based video captioning models. The implementation of our method is publicly available at https://github.com/ccc000-png/Tracker4Cap.


Repository Staff Only: item control page