Title
A comprehensive review of the video-to-text problem
Date Issued
01 June 2022
Access level
open access
Resource Type
journal article
Author(s)
Publisher(s)
Springer Science and Business Media B.V.
Abstract
Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. When the visual information is related to videos, this takes us into Video-Text Research, which includes several challenging tasks such as video question answering, video summarization with natural language, and video-to-text and text-to-video conversion. This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description. This association can be mainly made by retrieving the most relevant descriptions from a corpus or generating a new one given a context video. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. We analyze twenty-six benchmark datasets, showing their drawbacks and strengths for the problem requirements. We also show the progress that researchers have made on each dataset, we cover the challenges in the field, and we discuss future research directions.
Start page
4165
End page
4239
Volume
55
Issue
5
Language
English
OCDE Knowledge area
Ciencias de la computación
Subjects
Scopus EID
2-s2.0-85123167163
Source
Artificial Intelligence Review
ISSN of the container
02692821
Sponsor(s)
This work has been done as part of the Stic-AmSud Project 18-STIC-09, “Transforming multimedia data for indexing and retrieval purposes”. Jesus Perez-Martin is funded by ANID/Doctorado Nacional/2018-21180648. This work was partially supported by the ANID—Millennium Science Initiative Program - Code ICN17_002, the Department of Computer Science at University of Chile, and the Image and Multimedia Data Science Laboratory (IMScience) at PUC Minas.
Sources of information:
Directorio de Producción Científica
Scopus