Urdu Extractive Summarizer
Our final year project is to build a system which can generate Urdu extractive summary. User will provide articles by uploading file/ input text or the URL of the particular article. The system will generate the summary of the given article by using Natural Language Processing Techniques,
2025-06-28 16:36:31 - Adil Khan
Urdu Extractive Summarizer
Project Area of Specialization Artificial IntelligenceProject SummaryOur final year project is to build a system which can generate Urdu extractive summary. User will provide articles by uploading file/ input text or the URL of the particular article. The system will generate the summary of the given article by using Natural Language Processing Techniques, some of the techniques which are used in this project are Sentence Topical Quality Content Words, Sentence Quality Content Words, Term Frequency Inverse Document Frequency and Latent Dirichlet allocation (LDA) for topic modeling. User can also select the length of summary. The summary generated by the system will be compared to human written summary to check the accuracy and quality of the system generated summary. The techniques will be further improved so the result generated by the techniques can close to human written summary.
Project ObjectivesTo build an automatic text summarization system, that takes an Urdu language article as input and generates an extractive summary of that article as output.
Project Implementation MethodMethod
Natural Language Processing Techniques are used to generate the summary. Techniques which are used in this project are Sentence Topical Quality Content Words, In this technique the sentences are selected based upon topical quality. If the content of the sentences matches with the title, then the sentence has a higher chance of being in the summary. Sentence Quality Content Words, in this technique the sentences for summary are selected by checking the quality content in each sentence. Term Frequency Inverse Document Frequency (TFIDF) In TFIDF, TF generates the frequency of the word in the article, while IDF generates the importance of the word in the article and Latent Dirichlet allocation (LDA) is used for topic modeling of Urdu articles.
Benefits of the ProjectTime Saving
In this busy world, reading an entire article and splitting important key points is very time consuming, and takes a lot of effort. Even an article with 400-500 words can take more than 15 minutes. The navigation through hundreds of the documents in order to find the interesting information is a tough job and waste of the time and effort, due to such reason some people do not read articles. In order to address this problem a text summarizer will be a helpful tool for people, it will not only reduce their time of reading but will give them the only key details and relevant sentences about the article. It will turn larger content into short concise summary. Text summarizer will split the article in seconds and this will allow user to read less content and get more information about the article.
Benefit for Urdu Readers
Urdu is a less developed language as compared to English. That’s why it lacks resources of research and development for natural language processing. However, there are certain libraries and approaches we can follow to build an Extractive text summarizer. There are different techniques that have been proposed for extractive text summarizer. Each technique uses different kind of approach for text processing. There are less tools to generate Urdu extractive summary. So this tool can be very helpful for people who read Urdu articles.
Building a system for Urdu Language will be a very helpful tool for people who read Urdu articles.
Technical Details of Final DeliverableUser Interface
1. The final deliverable of this project will contain a user interface which will display the the input tags/ buttons for user to upload file/ URL/ text. The user interface will also display all the process to generate extractive summary, for example article tokenized into sentences, words and Data cleaning such as stop words removal, punctuation removal.
Sentence Selection
2. Sentence selection using Natural Language Processing Techniques. All the sentences which are important for the summary will be extracted.
URL Scrapper
3. User can also provide URL for summarization. The URL will be processed and all the content which is important will be scrapped and displayed on the user interface.
Urdu Extractive Summary and Comparison
4. The final deliverable will be Urdu Extractive summary which will be generated using NLP. The summary will be compared with human written summary and to check the quality of the summary precision, recall and f-measure will be calculated. The techniques will be further improved to bring the result of the system generated summary close to human written.
Final Deliverable of the Project Software SystemCore Industry EducationOther Industries Media Core Technology Artificial Intelligence(AI)Other TechnologiesSustainable Development Goals Quality EducationRequired Resources| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| Total in (Rs) | 9850 | |||
| Website Domain and Hosting | Miscellaneous | 1 | 8000 | 8000 |
| Report Printing | Miscellaneous | 90 | 15 | 1350 |
| Report Binding | Miscellaneous | 1 | 500 | 500 |