Audio Visual Speech Recognition

2025-06-28 16:25:11 - Adil Khan

Project Title

Project Area of Specialization Artificial IntelligenceProject Summary

In our daily life we deliver thoughts and thinking either by writing or speaking. in this fast-paces life where people are connected through internet as compared to communication through face to face. So, it is difficult for the people to communicate in noisy environment like in factory areas and those areas where internet connectivity is very week during the video conferencing call. So, we have decided to make a mobile app which help people regarding these issues so they can communicate smoothly.
The important issue is that a person who had lost his voice accidently or those people who are unable to hear. It is difficult for those people to communicate and for normal peoples to understand what they are saying. so, we have decided to provide a platform for those people to make their communication smooth so they can send and receive messages.

Project Objectives

This project focus on developing an artificial intelligence module which provide the facility to recognize movement of lips and converted it into the text without the involvement of audio. As people use different dictions and various ways to articulate a speech so it is very difficult to devise an automated lip-reading system. Our purpose is to make effective communication in noisy environment, excellent video call conferencing and online meeting and easy to communicate with faulty people.

The goal of this work is to recognize phrases and sentences being spoken by a talking face, without the audio. We tackle lip reading as an open-world problem and investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy. These models that we trained and, we will try to surpass the performance of all previous work on a lip-reading benchmark dataset by a significant margin.

Project Implementation Method

We will use “Incremental Developmental approach”. This is because we have classified our project requirements into stand-alone four modules.

In this Software Development Life Cycle (SDLC), there is four parts Requirement analysis, designing, coding, and Testing. This cycle requires good design planning but throughout of the development system changes can be done on each stage. This model is less costly as compares to other. In this system complexity generated when you rectify the problem in one stage it requires correction in all stages. It consumes a lot of time. In incremental model, we can reduce the complexity of the system by working throughout on the requirement of the client and prioritized its requirement. Thus, Incremental model is best approach to handle the system.

Benefits of the Project

? People can communicate effectively in the noisy environment.

? Who can interpret messages can use this efficiently.

? Best to use as a search option in other apps.

? Faulty people have a big go at it.

? Will be good at video calls conference

Technical Details of Final Deliverable

We will use feature extraction in our project which is normally refers to the process of extracting features (informative characteristics) from a frame in a video, independently of past or future frames. I would say that it is very similar to extracting features in a static image. Maybe there exist methods that use several frames to extract more stable features or something like that. This process, for example, is usually (not exclusively) done to initialize an object tracking method to recognize the object.

There will be many components that will need to be reused in different places, so we will try to write code that will be reusable in most cases. There are two parts to this project e.g., front-end and back-end. We will try to apply this methodology in both places.

We studied the course Human-computer interaction that helps us in designing our app, selecting the color scheme. It helps us make our application front-end more aesthetics and interactive. Feature tracking is the process of "following" or tracking some features from frame to frame. Normally different techniques are used to leverage the knowledge of the position of the features in the previous frame.

In the end we will use classification to predict the actual spoken words, classification refers to a predictive modeling problem where a class label is predicted for a given example of input data. Examples of classification problems include. Given an example, classify if it is spam or not. We studied the course Human-computer interaction that helps us in designing our app, selecting the color scheme. It helps us make our application front-end more aesthetics and interactive.

Final Deliverable of the Project Software SystemCore Industry ITOther Industries Others Core Technology Artificial Intelligence(AI)Other Technologies OthersSustainable Development Goals Quality EducationRequired Resources

Item Name	Type	No. of Units	Per Unit Cost (in Rs)	Total (in Rs)
			Total in (Rs)	0
Null	Miscellaneous	0	0	0

Audio Visual Speech Recognition

More Posts