Vipul Kumar Singh
Uma Shankar Tiwary
Dept. of Information Technology, Indian Institute of Information Technology Allahabad Prayagraj
This study introduces a novel method to generate informative descriptions for images by leveraging Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks. The proposed method extracts feature from images using object detection techniques and CNN and then utilizes an LSTM network to generate captions for the images. This research proposes a novel Encoder-Decoder architecture with an attention mechanism. The architecture utilizes convolutional features obtained from the Xception model, which is pre-trained on ImageNet. Additionally, it incorporates object features extracted from the SSD model, pretrained on the MS COCO dataset. Much work has been done on image captioning, but still, it needs way more improvement to caption an image like a human does. In our model, we tried to improve the score and make the generated captions more coherent with the image with a single-shot detector (SSD) object detection model. Our novel approach to feature extraction leads to a significant improvement of 29.75% in the METEOR score. The results indicate that the proposed approach outperforms existing methods, suggesting that it has the potential to be a valuable tool for automated image captioning in various applications.
Convolutional neural network