REAL TIME SPEECH DRIVEN FACE ANIMATION

2025-06-28 16:28:55 - Adil Khan

Project Title

Project Area of Specialization Artificial IntelligenceProject Summary

This project is made to construct and implement a real time speech to face animation system. The program is based on the Visage Technologies software. Neural networks are used to classify the incoming speech, and the program shows an animated face which mimics the sound. The animation is already implemented, so the work done in this project is focused on signal processing of an audio signal, and the implementation of speech to lip mapping and synchronization. It is very important that the facial animation and sound are synchronized, which makes demands on the program considering the time delay. Some time delay must be accepted, since speech has to be spoken before it can be classified. The goal set for this thesis is 100 ms as the upper limit of delay from input speech to visualization.

Both Matlab and Visual C++ are used to implement the work into a Windows application.

Project Objectives

The goal of this project is to implement a system to analyse an audio signal containing speech, and produce a classification of lip shape categories (visemes) in order to synchronize the lips of a computer generated face with the speech.

Project Implementation Method

The implementation is accomplished by integrating Matlab functions with C/C++ using Microsoft Visual C++ 6.0. The integration is possible by the use of Matlab Compiler, which can be used as a plug-in in Visual Studio and generates C++ code from m-functions. The functions can now be used in a similar manner as in Matlab, with the limitation that arrays and matrixes used by the functions must be of the type MwArray. The 38 Neural Networks are created and trained with the training database in Matlab. Their biases and weight matrixes are extracted and saved as Matlab files. These are loaded together with the Fisher matrix W (also calculated in Matlab) from the C-program. In order to play the sound we use source code from Microsoft’s DirectX 9.0 Software Development Kit (SDK). Since the calculations of the MFCCs require frames of 256 samples of raw sound data, the sound is segmented into that size. When a frame has been played, the played data is stored and calculations are made during the playback of the next frame. These calculations consists of MFCC extraction and simulation of the 38 Neural Networks. The outputs are added to the outputs from the previous frame. It is necessary that the calculation time does not exceed 16 ms, which is the time for the playback of a frame. Every fourth frame, the viseme class that has the largest sum of output values from the Neural Networks is presented on the screen. The program is based on the Visage Technologies software [2] to which it adds a new feature. The feature is called SpeechToFace and its Graphical User Interface (GUI).

Benefits of the Project

Real-world listening environments are often noisy: many people talk simultaneously in a busy pub or restaurant, background music plays frequently, and traffic noise is omnipresent in cities. Seeing a speaker’s face makes it considerably easier to understand them and this is particularly true for people with hearing impairments or who are listening in background noise. So the main benefit of the project is for these kind of peoples.

Technical Details of Final Deliverable

The project is implemented in C++ on the PC/Windows platform. The program reads speech from pre-recorded audio files and continuously performs spectral analysis of the speech. Neural networks are used to classify the speech into a sequence of phonemes, and the corresponding visemes are shown on the screen. Some time delay between input speech and the visualization could not be avoided, but the overall visual impression is that sound and animation are synchronized.

Final Deliverable of the Project Software SystemCore Industry ITOther Industries Manufacturing , Media Core Technology Artificial Intelligence(AI)Other Technologies RoboticsSustainable Development Goals Good Health and Well-Being for People, Quality EducationRequired Resources

Elapsed time in (days or weeks or month or quarter) since start of the project	Milestone	Deliverable
Month 1	analysis	Final analysis report
Month 2	design	final design
Month 3	coding	implemented project
Month 4	Testing	Tested project

REAL TIME SPEECH DRIVEN FACE ANIMATION

More Posts