Urdu News Corpa

We intend to build URDU base news corpus. That collect news from famous urdu news sources on daily basis. After crawling news from different websites i.e BBC Urdu, Express Urdu and Geo News our system will tag the news on different criteria. System also build relation between news a

2025-06-28 16:36:31 - Adil Khan

Project Title

Urdu News Corpa

Project Area of Specialization Artificial IntelligenceProject Summary

We intend to build URDU base news corpus. That collect news from famous urdu news sources on daily basis. After crawling news from different websites i.e BBC Urdu, Express Urdu and Geo News our system will tag the news on different criteria. System also build relation between news and find pattern in news. Our primary goal is to build an extensive corpus of urdu that base on real data and can be applied as test bed for different approches as well source for different real life applications. This news corpus contain tagging system similiar to DBpedia that work well for Urdu language. This corpus can be used for multiple natural languge processing systems and news aggregation systems.

Project Objectives

Objective of this project is to create a good URDU base corpus that can be use in numerous natural language processing applications. This corpus will contain urdu news from multiple news sources. After correct tagging and modeling of corpus this repository can be utilize in multiple machine learning system i.e. adhoc query system that can answer mulltiple events and related events with it. This corpus can be extended to a bigger knolwedge base that can serve bigger applications like DBpedia is serving for many applications.

Project Implementation Method

Project require multiple scheduled crawler that periodically scrap data from different news sources. These crawler would be developed in Python and a document database would be use to store the news in tagged form. MongoDB would be used as database. Corpus would be served as Web API for technical users as well as a Web Site for normal users. Web API and Web Site would be developed in Django a famous MVC framework for Web development. This all would be served on a dedicated server.

Benefits of the Project

Outcome of this project can used as base for Urdu DBpedia. This corpus is readily available through dedicated server that would encourage other researchers and developers to utilize this it for their urdu base NLP approches and many real life applications.

Technical Details of Final Deliverable

This final product is served as service for others on a dedicated server. This Corpus have more than 20M predicates that serve as facts for knowledge extraction in many urdu base languages.

Final Deliverable of the Project HW/SW integrated systemCore Industry ITOther Industries Media Core Technology Artificial Intelligence(AI)Other Technologies Others, Big DataSustainable Development Goals Industry, Innovation and InfrastructureRequired Resources
Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
Total in (Rs) 79650
Dell PowerEdge T110 II Equipment17000070000
A4 Paper Rim Miscellaneous 38002400
Reports Printing Cost Miscellaneous 100055000
Reports Binding Cost Miscellaneous 35001500
Other Stationary Items Miscellaneous 1550750

More Posts