Urdu Extractive Summarizer

Our final year project is to build a system which can generate Urdu extractive summary. User will provide articles by uploading file/ input text or the URL of the particular article. The system will generate the summary of the given article by using Natural Language Processing Techniques,

2025-06-28 16:36:31 - Adil Khan

Project Title

Urdu Extractive Summarizer

Project Area of Specialization Artificial IntelligenceProject Summary

Our final year project is to build a system which can generate Urdu extractive summary. User will provide articles by uploading file/ input text or the URL of the particular article. The system will generate the summary of the given article by using Natural Language Processing Techniques, some of the techniques which are used in this project are Sentence Topical Quality Content Words, Sentence Quality Content Words, Term Frequency Inverse Document Frequency and Latent Dirichlet allocation (LDA) for topic modeling. User can also select the length of summary. The summary generated by the system will be compared to human written summary to check the accuracy and quality of the system generated summary. The techniques will be further improved so the result generated by the techniques can close to human written summary.

Project Objectives

To build an automatic text summarization system, that takes an Urdu language article as input and generates an extractive summary of that article as output.

Project Implementation Method

Method

Natural Language Processing Techniques are used to generate the summary. Techniques which are used in this project are Sentence Topical Quality Content Words, In this technique the sentences are selected based upon topical quality. If the content of the sentences matches with the title, then the sentence has a higher chance of being in the summary. Sentence Quality Content Words, in this technique the sentences for summary are selected by checking the quality content in each sentence. Term Frequency Inverse Document Frequency (TFIDF) In TFIDF, TF generates the frequency of the word in the article, while IDF generates the importance of the word in the article and Latent Dirichlet allocation (LDA) is used for topic modeling of Urdu articles. 

Benefits of the Project

Time Saving

In this busy world, reading an entire article and splitting important key points is very time consuming, and takes a lot of effort. Even an article with 400-500 words can take more than 15 minutes. The navigation through hundreds of the documents in order to find the interesting information is a tough job and waste of the time and effort, due to such reason some people do not read articles. In order to address this problem a text summarizer will be a helpful tool for people, it will not only reduce their time of reading but will give them the only key details and relevant sentences about the article. It will turn larger content into short concise summary. Text summarizer will split the article in seconds and this will allow user to read less content and get more information about the article.

Benefit for Urdu Readers

Urdu is a less developed language as compared to English. That’s why it lacks resources of research and development for natural language processing. However, there are certain libraries and approaches we can follow to build an Extractive text summarizer. There are different techniques that have been proposed for extractive text summarizer. Each technique uses different kind of approach for text processing. There are less tools to generate Urdu extractive summary. So this tool can be very helpful for people who read Urdu articles. 

Building a system for Urdu Language will be a very helpful tool for people who read Urdu articles.

Technical Details of Final Deliverable

User Interface

1. The final deliverable of this project will contain a user interface which will display the the input tags/ buttons for user to upload file/ URL/ text. The user interface will also display all the process to generate extractive summary, for example article tokenized into sentences, words and Data cleaning such as stop words removal, punctuation removal.

Sentence Selection

2. Sentence selection using Natural Language Processing Techniques. All the sentences which are important for the summary will be extracted.

URL Scrapper

3. User can also provide URL for summarization. The URL will be processed and all the content which is important will be scrapped and displayed on the user interface.

Urdu Extractive Summary and Comparison

4. The final deliverable will be Urdu Extractive summary which will be generated using NLP. The summary will be compared with human written summary and to check the quality of the summary precision, recall and f-measure will be calculated. The techniques will be further improved to bring the result of the system generated summary close to human written.

Final Deliverable of the Project Software SystemCore Industry EducationOther Industries Media Core Technology Artificial Intelligence(AI)Other TechnologiesSustainable Development Goals Quality EducationRequired Resources
Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
Total in (Rs) 9850
Website Domain and Hosting Miscellaneous 180008000
Report Printing Miscellaneous 90151350
Report Binding Miscellaneous 1500500

More Posts