Design of a Deep Neural Network Model for Image Caption Generation


KIPS Transactions on Software and Data Engineering, Vol. 6, No. 4, pp. 203-210, Apr. 2017
10.3745/KTSDE.2017.6.4.203,   PDF Download:
Keywords: Image Caption Generation, Deep Neural Network Model, Model Transfer, Multi-Modal Recurrent Neural Network
Abstract

In this paper, we propose an effective neural network model for image caption generation and model transfer. This model is a kind of multi-modal recurrent neural network models. It consists of five distinct layers: a convolution neural network layer for extracting visual information from images, an embedding layer for converting each word into a low dimensional feature, a recurrent neural network layer for learning caption sentence structure, and a multi-modal layer for combining visual and language information. In this model, the recurrent neural network layer is constructed by LSTM units, which are well known to be effective for learning and transferring sequence patterns. Moreover, this model has a unique structure in which the output of the convolution neural network layer is linked not only to the input of the initial state of the recurrent neural network layer but also to the input of the multimodal layer, in order to make use of visual information extracted from the image at each recurrent step for generating the corresponding textual caption. Through various comparative experiments using open data sets such as Flickr8k, Flickr30k, and MSCOCO, we demonstrated the proposed multimodal recurrent neural network model has high performance in terms of caption accuracy and model transfer effect.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from September 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article
[IEEE Style]
D. Kim and I. Kim, "Design of a Deep Neural Network Model for Image Caption Generation," KIPS Transactions on Software and Data Engineering, vol. 6, no. 4, pp. 203-210, 2017. DOI: 10.3745/KTSDE.2017.6.4.203.

[ACM Style]
Dongha Kim and Incheol Kim. 2017. Design of a Deep Neural Network Model for Image Caption Generation. KIPS Transactions on Software and Data Engineering, 6, 4, (2017), 203-210. DOI: 10.3745/KTSDE.2017.6.4.203.