Urdu Audio Miner

Urdu Audio Miner is web based application to search for words or phrases in Urdu audio files. Searching for phrases in a few audio files might be easy but if we have to search in many audio files, this easy task can turn into a tiring and time consuming work. Automating this task will be of great he

2025-06-28 16:29:54 - Adil Khan

Project Title

Urdu Audio Miner

Project Area of Specialization Artificial IntelligenceProject Summary

Urdu Audio Miner is web based application to search for words or phrases in Urdu audio files. Searching for phrases in a few audio files might be easy but if we have to search in many audio files, this easy task can turn into a tiring and time consuming work. Automating this task will be of great help and we have tried to do this through our application. In our application, the user can upload audio files or provide a YouTube video link through which he wants to search. Then he can specify what word or phrase he wants to search in three different ways. By recording the word through a microphone, uploading an audio clip containing the word, or typing it in roman Urdu. Once the program has finished searching, it will display the text along with time stamps where a match was found. The user is also able to play the audio file from that location.

Apart from searching, other feature of Urdu Audio Miner includes detecting the use of offensive language in audio files. When a user has uploaded audio files, he can check if audio files contain abusive or offensive words, and play the audio.

Nowadays most of the apps are incorporating speech to text features. Our application also has a transcription feature. With real-time decoding, user can see the results as the audio file is processed. Once the transcription is completed the user can download it in text format.

Project Objectives
  1. Train an Urdu automatic speech recognition system to transcribe audio files.
  2. Train an offensive language classification model to check if a text is offensive or not.
  3. Get timestamps where the word or phrase matched.
  4. Integrate microphone recording and Roman Urdu to Urdu conversion.
  5. Make web interface for GUI tasks like uploading audios and showing results.
Project Implementation Method

As for all data science projects, we started with data collection. We collected around 115+ hours of Urdu audio files along with their transcriptions to train our automatic speech recognition model. Mostly our dataset included read speech. We also collected some data from YouTube videos as well. Then we clean the data which included correcting wrong pronounced words, replacing Arabic Unicode letters not used in Urdu etc. Using Kaldi framework for speech recognition we trained an HMM GMM based model. Using PronouncUR we generated lexicons for unique words in our corpus and also collected additional data for the language model. After formatting data according to Kaldi requirements, we trained monophone, triphone (tri1, tri2, tri3), and SGMM2 models.

For offensive language detection, we used an already available dataset and added additional data from offensive tweets. We trained our model on Naïve Bayes and Logistic Regression algorithms for this task. Different tools and libraries like Google Colab, NLTK, UrduHack, and Scikit-learn were used in preprocessing datasets and training the model.

Without a good user interface, even good programs can become difficult to use. For the backend of our web application, we used Flask along with other libraries such as PyAudio, and Deep Translate. For the front end, we used HTML, CSS, Bootstrap, and Javascript (Vanilla JS, Recorder JS, Wavesurfer JS).

Benefits of the Project
  1. Save time for manually searching through the audio files which are recorded in Urdu language.
  2. Transcribe Urdu audio files and download transcription in text format.
  3. Detect Urdu offensive words in audio files.
Technical Details of Final Deliverable

As for all data science projects, we started with data collection. We collected around 115+ hours of Urdu audio files along with their transcriptions to train our automatic speech recognition model. Mostly our dataset included read speech. We also collected some data from YouTube videos as well. Then we clean the data which included correcting wrong pronounced words, replacing Arabic Unicode letters not used in Urdu etc. Using Kaldi framework for speech recognition we trained an HMM GMM based model. Using PronouncUR we generated lexicons for unique words in our corpus and also collected additional data for the language model. After formatting data according to Kaldi requirements, we trained monophone, triphone (tri1, tri2, tri3), and SGMM2 models.

For offensive language detection, we used an already available dataset and added additional data from offensive tweets. We trained our model on Naïve Bayes and Logistic Regression algorithms for this task. Different tools and libraries like Google Colab, NLTK, UrduHack, and Scikit-learn were used in preprocessing datasets and training the model.

Without a good user interface, even good programs can become difficult to use. For the backend of our web application, we used Flask along with other libraries such as PyAudio, and Deep Translate. For the front end, we used HTML, CSS, Bootstrap, and Javascript (Vanilla JS, Recorder JS, Wavesurfer JS).

Final Deliverable of the Project Software SystemCore Industry MediaOther Industries Education , IT , Telecommunication Core Technology Artificial Intelligence(AI)Other TechnologiesSustainable Development Goals Decent Work and Economic GrowthRequired Resources
Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
Total in (Rs) 600
Stationary Miscellaneous 6100600

More Posts