Humans are capable of easily describing the setting they are in because of their they possess cognitive abilities. However, it is difficult for machines to infer the visual world around them. By blending the concepts of computer vision, natural language processing and deep learning, we understand th
Real time deep visual semantic alignments for creating automatic image captioning
Humans are capable of easily describing the setting they are in because of their they possess cognitive abilities. However, it is difficult for machines to infer the visual world around them. By blending the concepts of computer vision, natural language processing and deep learning, we understand the spatial relationship of objects in the images and describe them in natural language. The proposed system can be utilized in many different scenarios mainly in natural robot human interactions, navigation for the blind, image retrieval, creating social media content and for early childhood development.
Cross Modal retrieval is performed to generate semantically correct image descriptions by utilizing computer vision, natural language processing and deep learning techniques. To extract different features, objects and other distinct elements out of the input image, we employ a convolutional neural network. It produces a dense feature vector correspondingly called an embedding. These embedding are fed to Long-short term memory networks to predict the probabilities of word occurring in sentences allowing for generation of new, unique sentences. To create a sentence with alignment to the context of the input image we then implement a feed forward neural network, allowing for a semantically and syntactically correct sentence to form.
The algorithms are trained using Flickr8k dataset, whereas final implementation is on raspberry pi allowing real time descriptions to be formed and described. Our systems accuracy is calculated based on BLEU, METEOR, ROUGE and CIDER score. BLEU score quality is considered to be the correspondence between a machine's output and that of a human. first metrics to claim a high correlation with human judgements of quality. In BLEU, precision and recall are approximated by modified n-gram precision and best match length, respectively. METEOR assigns a score in the range of 0 to 1 to every candidate translation. It is a metric for the evaluation of machine translation output. METEOR modifies the precision and recall computations, replacing them with a weighted F-score based on mapping unigrams and a penalty function for incorrect word order. CIDER metric is used to evaluate the automatic concurrent captions for images. These are better than all previous metrices as their generated results are very close to human produced results. ROUGE is based only on recall, and is mostly used for summary evaluation.
The goal of our project “real time deep visual semantic alignments for creating automatic image captioning” is to perform cross modal retrieval to generate semantically correct image descriptions by utilizing computer vision, natural language processing and deep learning techniques. We will achieve our aim to make a device that can be useful for visually impaired people, for the robots and also for self-driving cars. The main objectives are as following:
For real time image caption generation, the system that is used depends on neural network for producing novel descriptions. Convolution neural network is used to obtain visual features. Sequence of words are generated by LSTM. This LSTM extended form of RNN works as encoder for embedding of words in word embedding layer. The image features and sequence of generated words are provided to feed forward neural network layer which on basis of likelihood produces the final output description. As the image features and linguistics are merge into Feed forward layer so the model is known as Merge Model. It is also known as Cross Model retrieval.Flickr 8k dataset is used as it is small as well as feasible and naturalistic for image captioning. Flickr 8k consist of 8000 images and also five captions related to each image. This dataset comprises of diverse landscapes and circumstances. Flickr website is used to download the dataset that is accessible publicly. From the images, we can perform multi-label classification. We start working on python 3 environment. We install the libraries of Keras with TensorFlow at backend, scikit learn, pandas and matplotlib. following are the major steps 1. images preprocessing (extraction of features using Convolution Neural Networks) 2. text preprocessing (word embedding layer and Long Short Term Memory module is used for words generation) 3. generation of Dep learning Model (Now in next step we select the deep learning model and fit it on the training dataset. we train the data on images and descriptions in given training dataset. Then we load the images using distinctive identifiers. Model that is chosen will generate caption when given input image. LSTM (long short-term memory) method used for this purpose. Initially It will generate one word at a time so sequence of formerly generated words will be given as input. So, we introduce the two strings i.e. “startseq” and “endseq”. “startseq” is use to start the generation task of words and “endseq” will depict the end of description. These strings will be added to each description as token. It will make encoding of text easier and contented. image will be given to LSTM model to generate the next word. Then again first two words along with image will be given to model so that next word will be produce and so on the words will generated. Lastly all the generated words will be concatenated and iteratively will be given back as input to produce caption. Thus, this method is use for training of model.) 4. Model Selection ( We use the Merge Model of deep learning in which preprocessed text descriptions and image features are merged together and undertaken by dense layer to make ultimate predictions.) 5.Implementation on raspberry pi on which real time image is captured by pi cam and shown on GUI made on lcd.then if button is pressed it generates the captions.
We hope that our project can be productive and effective application for the robots to observer the environment and also for visually impaired people. Also, this application can be utilized for other purposes like video framing and for automation purposes. We can enhance the linguistics more accurately.
| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| Raspberry-Pi 3b+ | Equipment | 1 | 7000 | 7000 |
| Raspberry Pi Camera Module v2 | Equipment | 1 | 6000 | 6000 |
| HDMI Cable connector | Equipment | 1 | 500 | 500 |
| speakers | Equipment | 1 | 800 | 800 |
| TX4506 LCD Screen | Equipment | 1 | 5200 | 5200 |
| Wireless mouse | Equipment | 1 | 1000 | 1000 |
| Power bank | Equipment | 1 | 3500 | 3500 |
| Heat sink | Equipment | 1 | 1000 | 1000 |
| Online GPU Server (paperspace) | Equipment | 1 | 7000 | 7000 |
| printing cost | Miscellaneous | 1 | 1000 | 1000 |
| Total in (Rs) | 33000 |
Project consist of a collar device for cows. Device contains multiple sensors (IP-enabled...
This project will detect fake or edited videos. this will detect fake videos using in new...
Due to covid-19 situation most of the shopping is shifted online. Our project is basically...
The project is to make a voltage regulator for a military tank of HIT which was originally...
The main purpose of this project is to build a high-speed wind turbine. In this facility,...