Analysis of Research Trends in Deep Learning-Based Video Captioning


KIPS Transactions on Software and Data Engineering, Vol. 13, No. 1, pp. 35-49, Jan. 2024
https://doi.org/10.3745/KTSDE.2024.13.1.35,   PDF Download:
Keywords: Video Captioning, computer vision, Natural Language Processing, Deep Learning
Abstract

Video captioning technology, as a significant outcome of the integration between computer vision and natural language processing, has emerged as a key research direction in the field of artificial intelligence. This technology aims to achieve automatic understanding and language expression of video content, enabling computers to transform visual information in videos into textual form. This paper provides an initial analysis of the research trends in deep learning-based video captioning and categorizes them into four main groups: CNN-RNN-based Model, RNN-RNN-based Model, Multimodal-based Model, and Transformer-based Model, and explain the concept of each video captioning model. The features, pros and cons were discussed. This paper lists commonly used datasets and performance evaluation methods in the video captioning field. The dataset encompasses diverse domains and scenarios, offering extensive resources for the training and validation of video captioning models. The model performance evaluation method mentions major evaluation indicators and provides practical references for researchers to evaluate model performance from various angles. Finally, as future research tasks for video captioning, there are major challenges that need to be continuously improved, such as maintaining temporal consistency and accurate description of dynamic scenes, which increase the complexity in real-world applications, and new tasks that need to be studied are presented such as temporal relationship modeling and multimodal data integration.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from September 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article
[IEEE Style]
L. Zhi, E. Lee, Y. Kim, "Analysis of Research Trends in Deep Learning-Based Video Captioning," KIPS Transactions on Software and Data Engineering, vol. 13, no. 1, pp. 35-49, 2024. DOI: https://doi.org/10.3745/KTSDE.2024.13.1.35.

[ACM Style]
Lyu Zhi, Eunju Lee, and Youngsoo Kim. 2024. Analysis of Research Trends in Deep Learning-Based Video Captioning. KIPS Transactions on Software and Data Engineering, 13, 1, (2024), 35-49. DOI: https://doi.org/10.3745/KTSDE.2024.13.1.35.