Voice based Gender Recognition Using Deep Learning

Acoustic analysis of the voice depend upons parameter settings specific to sample characteristics such as intensity, duration, frequency and filtering . The acoustic properties of the voice and speech can be used to detect gender of speaker. warbleR R package is designed for acoustic analysis. The d

2025-06-28 16:29:58 - Adil Khan

Project Title

Voice based Gender Recognition Using Deep Learning

Project Area of Specialization Artificial IntelligenceProject Summary

Acoustic analysis of the voice depend upons parameter settings specific to sample characteristics such as intensity, duration, frequency and filtering . The acoustic properties of the voice and speech can be used to detect gender of speaker. warbleR R package is designed for acoustic analysis. The data set which have acoustic parameters can be obtained with this analysis. The data set can be trained with different machine learning algorithms. In this project, MLP has been used to obtain model. The results have been compared with related work. A web page has been designed to detect the gender of voice by using obtained model.

Acoustic analysis of the voice depend upons parameter settings specific to sample characteristics such as intensity, duration, frequency and filtering . The acoustic properties of the voice and speech can be used to detect gender of speaker. warbleR R package is designed for acoustic analysis. The data set which have acoustic parameters can be obtained with this analysis. The data set can be trained with different machine learning algorithms. In this project, MLP has been used to obtain model. The results have been compared with related work. A web page has been designed to detect the gender of voice by using obtained model.

Project Objectives

The main aim of this project is to detect the gender of a person by using his/her voice.

Project Implementation Method

All training, test and prediction codes have been written by using Python libraries. Data have been loaded from csv file into Numpy arrays with built-in Python libraries. Data set has been loaded from csv file into 2 dimension Python array. Each row has 20 parameters and 1 label. The array has been shuffled randomly. It has been splitted to 5 chunks. First 4 chunk has 633 data but last has 636 data. Also last column of data, which is label, has been converted integer as 0 for male and 1 for female and added to Python array to 5 chunks. 5-Fold cross validation has been used and average score has been obtained. Training and test loop have been run 5 times. On each run different chunk has been used for test, other chunks are concatenated to Numpy array and used for training. On each loop, 20% of data has been used for test and 10% of data has been used for validation. Keras has been used top of Tensorflow and has been configured to use GPU. 1 input layer, 4 hidden layers and 1 output layer have been used to build our model. Input layer has 20 inputs and connected to first hidden layer which has 64 perceptrons. Second and third hidden layers have each 256 perceptrons. Forth hidden layer has 64 perceptrons. The output layer has 2 perceptrons. Softmax activation function conducted in output layer to obtain the categorical distribution of the result for labels. Dropout 0.25 has been applied between each hidden layers. Dropout consists of randomly setting a fraction of input units to 0 at each update during training time. In this way, it helps to prevent overfitting. Nadam optimization algorithm in Keras has been used to train our model. The learning rate has been chosen 0.001. This gave us slow learning but it prevents us to miss minimum. By choosing lower learning rate our model has been trained with 150 epochs. Total training time is around 100-120 sec for each fold. Several loss function has been tested with our model and Kullback–Leibler divergence  algorithm has been chosen which gave best performance and accuracy.

All training, test and prediction codes have been written by using Python libraries. Data have been loaded from csv file into Numpy arrays with built-in Python libraries. Data set has been loaded from csv file into 2 dimension Python array. Each row has 20 parameters and 1 label. The array has been shuffled randomly. It has been splitted to 5 chunks. First 4 chunk has 633 data but last has 636 data. Also last column of data, which is label, has been converted integer as 0 for male and 1 for female and added to Python array to 5 chunks. 5-Fold cross validation has been used and average score has been obtained. Training and test loop have been run 5 times. On each run different chunk has been used for test, other chunks are concatenated to Numpy array and used for training. On each loop, 20% of data has been used for test and 10% of data has been used for validation. Keras has been used top of Tensorflow and has been configured to use GPU. 1 input layer, 4 hidden layers and 1 output layer have been used to build our model. Input layer has 20 inputs and connected to first hidden layer which has 64 perceptrons. Second and third hidden layers have each 256 perceptrons. Forth hidden layer has 64 perceptrons. The output layer has 2 perceptrons. Softmax activation function conducted in output layer to obtain the categorical distribution of the result for labels. Dropout 0.25 has been applied between each hidden layers. Dropout consists of randomly setting a fraction of input units to 0 at each update during training time. In this way, it helps to prevent overfitting. Nadam optimization algorithm in Keras has been used to train our model. The learning rate has been chosen 0.001. This gave us slow learning but it prevents us to miss minimum. By choosing lower learning rate our model has been trained with 150 epochs. Total training time is around 100-120 sec for each fold. Several loss function has been tested with our model and Kullback–Leibler divergence  algorithm has been chosen which gave best performance and accuracy.

Benefits of the Project

The model obtained in project show us that we can use acoustic properties of the voices and speech to detect the voice gender. MLP has been used to obtain the model for classification from data set which have the parameters of voice samples.

This will be very beneficial for the agencies.

The model obtained in project show us that we can use acoustic properties of the voices and speech to detect the voice gender. MLP has been used to obtain the model for classification from data set which have the parameters of voice samples.

This will be very beneficial for the agencies.

Technical Details of Final Deliverable

In this project, a Multilayer Perceptron (MLP) deep learning model has been described to recognize voice gender. The data set have 3,168 recorded samples of male and female voices. The samples are produced by using acoustic analysis. An MLP deep learning algorithm has been applied to detect gender specific traits. Our model achieves 96.74% accuracy on the test data set. Also the interactive web page has been built for recognition gender of voice.

Final Deliverable of the Project Software SystemCore Industry ITOther Industries Education Core Technology Artificial Intelligence(AI)Other Technologies RoboticsSustainable Development Goals Good Health and Well-Being for PeopleRequired Resources

All training, test and prediction codes have been written by using Python libraries. Data have been loaded from csv file into Numpy arrays with built-in Python libraries. Data set has been loaded from csv file into 2 dimension Python array. Each row has 20 parameters and 1 label. The array has been shuffled randomly. It has been splitted to 5 chunks. First 4 chunk has 633 data but last has 636 data. Also last column of data, which is label, has been converted integer as 0 for male and 1 for female and added to Python array to 5 chunks. 5-Fold cross validation has been used and average score has been obtained. Training and test loop have been run 5 times. On each run different chunk has been used for test, other chunks are concatenated to Numpy array and used for training. On each loop, 20% of data has been used for test and 10% of data has been used for validation. Keras has been used top of Tensorflow and has been configured to use GPU. 1 input layer, 4 hidden layers and 1 output layer have been used to build our model. Input layer has 20 inputs and connected to first hidden layer which has 64 perceptrons. Second and third hidden layers have each 256 perceptrons. Forth hidden layer has 64 perceptrons. The output layer has 2 perceptrons. Softmax activation function conducted in output layer to obtain the categorical distribution of the result for labels. Dropout 0.25 has been applied between each hidden layers. Dropout consists of randomly setting a fraction of input units to 0 at each update during training time. In this way, it helps to prevent overfitting. Nadam optimization algorithm in Keras has been used to train our model. The learning rate has been chosen 0.001. This gave us slow learning but it prevents us to miss minimum. By choosing lower learning rate our model has been trained with 150 epochs. Total training time is around 100-120 sec for each fold. Several loss function has been tested with our model and Kullback–Leibler divergence  algorithm has been chosen which gave best performance and accuracy.

More Posts