Optical Character Recognition for Urdu Language
Problem Statement: There is a lot of Urdu literature which is slowly being depleted due to it being very old and it can be preserved using OCR technology for future generations to read in a digital manner. There are more than 170 million Urdu language speakers in the w
2025-06-28 16:34:22 - Adil Khan
Optical Character Recognition for Urdu Language
Project Area of Specialization Artificial IntelligenceProject SummaryProblem Statement:
There is a lot of Urdu literature which is slowly being depleted due to it being very old and it can be preserved using OCR technology for future generations to read in a digital manner.
There are more than 170 million Urdu language speakers in the world but many of them live abroad and do not have access to Urdu Literature so it is important to have literature in machine readable form so that computers can process text from an image.
Summary:
The project titled, Optical Character Recognition for Urdu Language, is about converting text in images into readable text by a machine. Since Urdu documents almost only exist in analog form, with the help of our OCR people will be able to convert them into digital form. Moreover, Urdu literature which is being depleted due to it being very old can be preserved using OCR technology for future generations to be read digitally. An android application will be developed which will take images as input, extract text from it, and bring the extracted text into editable form. Very little work has taken place in this field, whatever exists are web based projects and mostly English OCRs, but no Android based Urdu OCRs are available on the Google Play Store.
Project ObjectivesProject Objectives:
Urdu Optical Character Recognition is a software which resolves many problems faced by users. The problems solved by Urdu OCR are as follows:
Extracts Text:
Urdu OCR is providing the text in digital form to the user by extracting the text from the image, the user can edit/copy the text according to his/her need.
Provides image of unknown Text:
Urdu OCR is providing an image of unknown word to the user so that the user can get the output of the unknown word also, while other software’s give incorrect output to the user. This will help the user to read the Urdu text as well as unknown text on the editor. The unknown text could be in other languages like, English, Arabic and Farsi etc.
Provides Blank Space:
Urdu OCR is also providing blank space to the user when there is a blurry character in the image so that user can write it himself/ herself to correct the output. This will help the user to add the word which is blurred or not recognized.
Urdu OCR is easy to use, the user can just simply point the cell phone's camera at the article with Urdu text on it and the necessary information will be extricated. This instinctive cycle is altogether quicker and will definitely give a far superior client experience than the ones that require manual information passage.
Project Implementation MethodFollowing steps are involved in the formation of the Urdu language OCR:
- Image Acquisition
- Pre-Processing
- Segmentation
- Feature Extraction
- Classification
Image Acquisition:
It is the process of taking input from the user, the input can be done two ways:
- The user can capture the image using their device.
- The user can upload the image from the gallery.
Pre- Processing:
Pre-processing is usually done to prepare an image for next steps of a project and in this case segmentation, feature extraction and classification. Some of the commonly used pre-processing techniques have been discussed and these are also the techniques which we are going to use.
Gray scaling: If the image is converted into a grey scale image, some color information is removed but still, the image has unnecessary information.
Thresholding: The process of converting an RGB or Grey image to a bi-level image is known as thresholding, Thresholding makes the acquired image small, fast and easy to analyze by removing all the unnecessary color information.
•Noise Removal: The acquired images are usually distorted with unwanted elements. The external disturbance that leads to the degradation of an image signal is known as noise.
•Smoothing: Smoothing is a procedure in which unwanted noise is removed from the edges of the image. Mostly, morphological operations are used for the purpose of smoothing.
•De-Skewing: Skewness of the document is when the lines of text become tilted. Skewness can be introduced a result of bad photocopying or scanning. Skewness leads to numerous problems in segmentation.
•Thinning: It is a process of deleting the dark points along the edges of an object in an image.
Segmentation:
The accuracy of an OCR technology highly depends on the segmentation of words or ligatures that are to be classified as classifying units. If the segmentation is not done correctly, there is no chance that the classifier would be able to classify the words correctly. Segmenting words is not an easy task when it comes to Urdu text because many of the Urdu words are combination of two or more ligatures. There are 3 types of Segmentation:
- Lines Segmentation
- Ligature Segmentation
- Words Segmentation
Feature Extraction:
Feature Extraction is done to extract different features from an image so that we can predict what the object/text in the image is based on the extracted features. We will use the following algorithms for feature extraction,
- Discrete Cosine Transform (DCT)
- Gabor Features
Classification and Recognition:
Classification and Recognition algorithms are used to predict text/objects based on the features extracted in the previous steps. We will use Hidden Markov Model (HMM) for classification and recognition.
Benefits of the ProjectBenefits of Project:
Urdu Optical Character Recognition will help the user to preserve the Urdu literature which is slowly dwindling digitally, the user can easily interact with the extracted text by editing or copying the text. It will made prolonged documents available in the electronic form. The data which is preserved in the digital form can be accessed quickly within a span of few clicks. Our software Urdu Optical Character Recognition is different from other existing software’s because we have some different features included in our software. We are providing an image of unknown word and blank space to the user when there is a blurry character in the image. Urdu Optical Character Recognition have many advantages for the users,
- Saves Time:
Optical Character Recognition saves time for the users by preserving the documents digitally, the user can easily scan the images through OCR and can obtain the digitized output on the mobile screen. The user can work with the obtained output efficiently.
- ???????Make Changes:
The user can make changes in the text after the text is extracted from the mobile using OCR technology. The text will be open on an editor after the extraction so the user can make variations in the extracted text or the user can copy the text to use it for other purposes.
- Saves Storage:
Optical Character Recognition Technology can save space in the user’s device. When the user will prefer to save it in the document format, it will save the storage of user’s device. Document format requires lesser storage space as opposed to image format.
- Different Features:
We are providing an image of unknown word to the user so that the user can get the output of the unknown word also, while other software’s give incorrect output to the user. We are also providing blank space to the user when there is a blurry character in the image so that user can write it himself/ herself to correct the output.
The OCR for Urdu Language can ease the user by providing these all facilities, the user can make good use of the OCR by using it according to his/her need. OCR for Urdu helps the user with any data-related processes and gives more time to user to concentrate on his/her primary goals. Urdu OCR are providing an image of unknown word and blank space to the user when there is a blurry character in the image.
Technical Details of Final DeliverableThe Final Deliverable of our project "Optical Character Recognition for Urdu Language" is an Android Application which will perform the following tasks,
Extracts Text:
Urdu OCR is providing the text in digital form to the user by extracting the text from the image, the user can edit/copy the text according to his/her need. It will perform the following tasks,
- Image Acquisition
- Pre-Processing
- Segmentation
- Feature Extraction
- Classification
Provides image of unknown Text:
Urdu OCR is providing an image of unknown word to the user so that the user can get the output of the unknown word also, while other software’s give incorrect output to the user. This will help the user to read the Urdu text as well as unknown text on the editor. The unknown text could be in other languages like, English, Arabic and Farsi etc.
Provides Blank Space:
Urdu OCR is also providing blank space to the user when there is a blurry character in the image so that user can write it himself/ herself to correct the output. This will help the user to add the word which is blurred or not recognized.
Final Deliverable of the Project Software SystemCore Industry ITOther IndustriesCore Technology Artificial Intelligence(AI)Other TechnologiesSustainable Development Goals Quality EducationRequired Resources| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| Total in (Rs) | 20000 | |||
| A4 Paper for Input | Miscellaneous | 1000 | 10 | 10000 |
| Scanning Equipment | Equipment | 10 | 1000 | 10000 |