Adil Khan 11 months ago
AdiKhanOfficial #FYP Ideas

Real time deep visual semantic alignments for creating automatic image captioning

Humans are capable of easily describing the setting they are in because of their they possess cognitive abilities. However, it is difficult for machines to infer the visual world around them. By blending the concepts of computer vision, natural language processing and deep learning, we understand th

Project Title

Real time deep visual semantic alignments for creating automatic image captioning

Project Area of Specialization

Artificial Intelligence

Project Summary

Humans are capable of easily describing the setting they are in because of their they possess cognitive abilities. However, it is difficult for machines to infer the visual world around them. By blending the concepts of computer vision, natural language processing and deep learning, we understand the spatial relationship of objects in the images and describe them in natural language. The proposed system can be utilized in many different scenarios mainly in natural robot human interactions, navigation for the blind, image retrieval, creating social media content and for early childhood development.

Cross Modal retrieval is performed to generate semantically correct image descriptions by utilizing computer vision, natural language processing and deep learning techniques. To extract different features, objects and other distinct elements out of the input image, we employ a convolutional neural network. It produces a dense feature vector correspondingly called an embedding. These embedding are fed to Long-short term memory networks to predict the probabilities of word occurring in sentences allowing for generation of new, unique sentences. To create a sentence with alignment to the context of the input image we then implement a feed forward neural network, allowing for a semantically and syntactically correct sentence to form.

The algorithms are trained using Flickr8k dataset, whereas final implementation is on raspberry pi allowing real time descriptions to be formed and described. Our systems accuracy is calculated based on BLEU, METEOR, ROUGE and CIDER score. BLEU score quality is considered to be the correspondence between a machine's output and that of a human. first metrics to claim a high correlation with human judgements of quality. In BLEU, precision and recall are approximated by modified n-gram precision and best match length, respectively. METEOR assigns a score in the range of 0 to 1 to every candidate translation. It is a metric for the evaluation of machine translation output. METEOR modifies the precision and recall computations, replacing them with a weighted F-score based on mapping unigrams and a penalty function for incorrect word order. CIDER metric is used to evaluate the automatic concurrent captions for images. These are better than all previous metrices as their generated results are very close to human produced results. ROUGE is based only on recall, and is mostly used for summary evaluation.

Project Objectives

The goal of our project “real time deep visual semantic alignments for creating automatic image captioning” is to perform cross modal retrieval to generate semantically correct image descriptions by utilizing computer vision, natural language processing and deep learning techniques. We will achieve our aim to make a device that can be useful for visually impaired people, for the robots and also for self-driving cars. The main objectives are as following:

  • Create image feature encodings based on detected objects
  • Understand linguistic features
  • Create word embeddings for sentences
  • Generate Image captions
  • Implement system on raspberry pi integrated with pi-cam

Project Implementation Method

For real time image caption generation, the system that is used depends on neural network for producing novel descriptions. Convolution neural network is used to obtain visual features. Sequence of words are generated by LSTM. This LSTM extended form of RNN works as encoder for embedding of words in word embedding layer. The image features and sequence of generated words are provided to feed forward neural network layer which on basis of likelihood produces the final output description. As the image features and linguistics are merge into Feed forward layer so the model is known as Merge Model. It is also known as Cross Model retrieval.Flickr 8k dataset is used as it is small as well as feasible and naturalistic for image captioning. Flickr 8k consist of 8000 images and also five captions related to each image. This dataset comprises of diverse landscapes and circumstances. Flickr website is used to download the dataset that is accessible publicly. From the images, we can perform multi-label classification. We start working on python 3 environment. We install the libraries of Keras with TensorFlow at backend, scikit learn, pandas and matplotlib. following are the major steps 1. images preprocessing (extraction of features using Convolution Neural Networks) 2. text preprocessing (word embedding layer and Long Short Term Memory module is used for words generation) 3. generation of Dep learning Model (Now in next step we select the deep learning model and fit it on the training dataset. we train the data on images and descriptions in given training dataset. Then we load the images using distinctive identifiers. Model that is chosen will generate caption when given input image. LSTM (long short-term memory) method used for this purpose. Initially It will generate one word at a time so sequence of formerly generated words will be given as input. So, we introduce the two strings i.e. “startseq” and “endseq”. “startseq” is use to start the generation task of words and “endseq” will depict the end of description. These strings will be added to each description as token. It will make encoding of text easier and contented. image will be given to LSTM model to generate the next word. Then again first two words along with image will be given to model so that next word will be produce and so on the words will generated. Lastly all the generated words will be concatenated and iteratively will be given back as input to produce caption. Thus, this method is use for training of model.) 4. Model Selection ( We use the Merge Model of deep learning in which preprocessed text descriptions and image features are merged together and undertaken by dense layer to make ultimate predictions.) 5.Implementation on raspberry pi on which real time image is captured by pi cam and shown on GUI made on lcd.then if button is pressed it generates the captions. 

Benefits of the Project

We hope that our project can be productive and effective application for the robots to observer the environment and also for visually impaired people. Also, this application can be utilized for other purposes like video framing and for automation purposes. We can enhance the linguistics more accurately.  

Technical Details of Final Deliverable

  • Create image feature encodings based on detected objects
  • Understand linguistic features
  • Create word embeddings for sentences
  • Generate Image captions
  • Implement system on raspberry pi integrated with pi-cam

Final Deliverable of the Project

HW/SW integrated system

Type of Industry

IT

Technologies

Artificial Intelligence(AI)

Sustainable Development Goals

Industry, Innovation and Infrastructure, Partnerships to achieve the Goal

Required Resources

Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
Raspberry-Pi 3b+ Equipment170007000
Raspberry Pi Camera Module v2 Equipment160006000
HDMI Cable connector Equipment1500500
speakers Equipment1800800
TX4506 LCD Screen Equipment152005200
Wireless mouse Equipment110001000
Power bank Equipment135003500
Heat sink Equipment110001000
Online GPU Server (paperspace) Equipment170007000
printing cost Miscellaneous 110001000
Total in (Rs) 33000
If you need this project, please contact me on contact@adikhanofficial.com
Cow Monitoring System

Project consist of a collar device for cows. Device contains multiple sensors (IP-enabled...

1675638330.png
Adil Khan
11 months ago
Deepfake Video Detection using Deep Learning

This project will detect fake or edited videos. this will detect fake videos using in new...

1675638330.png
Adil Khan
11 months ago
AI-driven web portal for online shopping

Due to covid-19 situation most of the shopping is shifted online. Our project is basically...

1675638330.png
Adil Khan
11 months ago
Voltage Regulator for electric system of Tank ( TK T-85)

The project is to make a voltage regulator for a military tank of HIT which was originally...

1675638330.png
Adil Khan
11 months ago
Highway wind turbine with solar power having IOT control

The main purpose of this project is to build a high-speed wind turbine. In this facility,...

1675638330.png
Adil Khan
11 months ago