Speech Emotion Recognition using Machine Learning

2025-06-28 16:29:37 - Adil Khan

Project Title

Project Area of Specialization Artificial IntelligenceProject Summary

Speech emotion recognition is a challenging task primarily because of inter and intra speech to emotion variability. Thus identifying the discriminative acoustic features that can infer human emotions both correctly and in real-time is still an open research problem. Automated real time emotion recognition has a wide range of applications in diverse fields which use human emotional reactions. The application areas include (though not limited to): health care, marketing, assisted living, and human-robot interaction. For this purpose, different smart devices, because of their continuous use, can potentially act as human sensors. These smart devices include smartwatch and smartphones. These devices are capable of capturing real-time speech signals, and send data to cloud or mobile app for human emotion recognition. In this project, we will research and develop a speech-emotion recognition system. The system will be able to recognize various emotions like anger, boredom, disgust, fear, happiness, sadness, and neutral state. This set of emotional states is widely used for emotion recognition purposes. The system will detect the emotions from the voice by extracting meaningful features and using machine learning algorithms. Previous research in this field is mostly focused on handcrafted features and traditional convolutional neural network (CNN) models are used to extract high-level features from speech spectrograms to increase the recognition accuracy. However; the proposed methodologies have high computational cost. In our Final Year Project (FYP), we plan to propose a novel lightweight method that can infer emotions from speech in real-time. In addition to this, we aim to record speech samples from individuals using a voice recorder, smartwatch, and smartphone simultaneously. This way we will be able to implement and evaluate our proposed approach on a large and diverse dataset. Moreover, we will also be able to evaluate and compare the accuracy of different speech sampling devices. Our aim is to develop a method that will predict emotions from speech samples collected from smart devices, since these smart devices are human sensors, and they can ubiquitously infer the emotional state of its owner.

Project Objectives

Humans tend to convey messages not just by using the spoken words but by also using their tones, body language, and expressions. The same message spoken in two different manners can have very different meanings. The objective of this FYP is to study and develop a Speech to Emotion Recognition (SER) system that can find notable and discriminatory features from speech signals to reflect a speaker's emotional state in real-time from their acoustic signals logged from a smart device e.g., a smartwatch and smartphone. For comparison purposes, we will collect the same speech samples from a dedicated voice recorder. The application is used to achieve various goals, in robotics, health care, independent living, marketing, education and entertainment industries. In order to effectively utilise data to get emotion and affective health which take into account the richness of everyday life, we need to measure affective states unobtrusively during these situations. Smart devices like smartphones and smartwatches include sensors, which have the potential to act as human sensors, and thus could provide rich and accessible information in this respect. Moreover, another objective of this project is to take speech samples from individuals using a voice recorder belonging to different demographics to increase our dataset on this vertical as well.

Project Implementation Method

The project will be implemented using following technologies:

Smart wearable preferably a smartwatch to log in speech snippets and model running on phone to recognize emotion
Smartphone to log speech snippets and model running on phone to recognize emotion
Signal processing techniques to preprocess data
Lightweight and real-time machine learning model running on smart devices to recognise and classify emotional state of the speaker

To summarize, our goal is to use smart devices to log speech snippets, extract informative features from these snippets, and send sensed data to the model running on the device. The model will then classify the emotional state of the speaker. Once classified, the emotional data can be used to benefit a wide range of industries, from retail to healthcare, achieving their business objectives.

Benefits of the Project

There would be a broad range of benefits of this project including intelligent human computer interaction, healthcare, education, retail, gaming, automotive and even security. Human emotion recognition also plays an important role in the interpersonal relationship. Following are the major benefits of the project:

Extracting and understanding emotion has a high importance of the interaction between human and machine communication.
One of the most interesting use cases in the retail industry is marketing/advertising based on our existing customer and partner engagements. Companies want to know how people respond to advertisement, goods, packaging and store design.
In robotics, to design smart collaborative or service robots which can interact with humans
In education, real-time learners’ emotional response to educational content can be recorded in real time and used for adapted and personalized content. It also paves the way for unobtrusive and real-time capturing of learners’ emotional states for enhancing adaptive e-learning approaches.
Playing an active role in security, it can identify people in a crowd, monitor citizens’ current emotional condition for suspicious conduct. It can be used to deter offenders and suspected terrorists preemptively.
These schemes can use SER models to assist elderly people to live independently by monitoring their health through their emotional state.
Emotions can also help determine if someone is under stress and going through depression. Depression can cause a ripple effect of various other diseases like insomnia, troubled memory, heart attack risk, weight fluctuations, fatigue and weakened immune system.

Technical Details of Final Deliverable

The final deliverable will consist of following implementations:

Record speech data from individuals belonging to different demographic groups using a dedicated voice recorder, smartwatch, and a smartphone
Standardize dataset using different data science techniques so it can be used for further research in future
Build the personal lightweight models using statistical features that are computationally inexpensive, which would make it feasible to deploy an application on a smartwatch or a smartphone that can track emotions speech data coming from a device without taxing their processor or battery significantly. This means that an application deployed on a smartwatch or a smartphone.
Smartwatch/mobile application to get speech snippets logged by the smart devices and feed data to the machine learning model for classification

Final Deliverable of the Project HW/SW integrated systemCore Industry ITOther Industries Education , Medical , Manufacturing , Media , Others , Health , Security Core Technology Artificial Intelligence(AI)Other Technologies Wearables and Implantables, Big DataSustainable Development Goals Good Health and Well-Being for People, Quality Education, Industry, Innovation and Infrastructure, Partnerships to achieve the GoalRequired Resources

Item Name	Type	No. of Units	Per Unit Cost (in Rs)	Total (in Rs)
			Total in (Rs)	48500
SmartWatch	Equipment	1	35000	35000
Digital Voice Recorder	Equipment	1	13500	13500

Speech Emotion Recognition using Machine Learning

More Posts