Aug 20,2022 Scientific research & Postgraduate Studies, ICT Engineering

Pre-trained CNNs as Feature-Extraction Modules for Image Captioning: An Experimental Study

Researchers

Muhammad Abdelhadie Al-Malla, Assef Jafar and Nada Ghneim

Published in

Electronic Letters on Computer Vision and Image Analysis (ELCVIA), volume 21, no 1, February 2022.

 

Abstract

In this work, we present a thorough experimental study about feature extraction using Convolutional Neural Networks (CNNs) for the task of image captioning in the context of deep learning. We perform a set of 72 experiments on 12 image classification CNNs pre-trained on the ImageNet [29] dataset. The features are extracted from the last layer after removing the fully connected layer and fed into the captioning model. We use a unified captioning model with a fixed vocabulary size across all the experiments to study the effect of changing the CNN feature extractor on image captioning quality. The scores are calculated using the standard metrics in image captioning. We find a strong relationship between the model structure and the image captioning dataset and prove that VGG models give the least quality for image captioning feature extraction among the tested CNNs. In the end, we recommend a set of pre-trained CNNs for each of the image captioning evaluation metrics we want to optimise, and show the connection between our results and previous works. To our knowledge, this work is the most comprehensive comparison between feature extractors for image captioning.

Keywords: Convolutional Neural Network, Feature Extraction, Image Captioning, Deep Learning.

Link to Read Full Paper

https://doi.org/10.5565/rev/elcvia.1436