Describing images using CNN and object features with attention

Varsha Singh

Km Khushaboo

Vipul Kumar Singh

Uma Shankar Tiwary

Dept. of Information Technology, Indian Institute of Information Technology Allahabad Prayagraj

India

e-mail: rsi2018002@iiita.ac.in

Abstract:

This study introduces a novel method to generate informative descriptions for images by leveraging Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks. The proposed method extracts feature from images using object detection techniques and CNN and then utilizes an LSTM network to generate captions for the images. This research proposes a novel Encoder-Decoder architecture with an attention mechanism. The architecture utilizes convolutional features obtained from the Xception model, which is pre-trained on ImageNet. Additionally, it incorporates object features extracted from the SSD model, pretrained on the MS COCO dataset. Much work has been done on image captioning, but still, it needs way more improvement to caption an image like a human does. In our model, we tried to improve the score and make the generated captions more coherent with the image with a single-shot detector (SSD) object detection model. Our novel approach to feature extraction leads to a significant improvement of 29.75% in the METEOR score. The results indicate that the proposed approach outperforms existing methods, suggesting that it has the potential to be a valuable tool for automated image captioning in various applications.

Key words:

Image captioning

Object features

Convolutional neural network

Deep learning

SSD

Section:

Information Technologies

Topics:

Computational Social Science and Social Media

Hypermedia

Multimedia

Presentation

InfoTech conference

Conference program is published

Describing images using CNN and object features with attention