Applying semantic search and Natural language inference on Quranic Verses
This research project is divided mainly into two parts, one of it being the semantic-based similarity search which would find similar Quranic verses based on queries given to the system and was decided to be completed in part 1 of the project, and, the second being letting the user know about t
2025-06-28 16:25:08 - Adil Khan
Applying semantic search and Natural language inference on Quranic Verses
Project Area of Specialization Artificial IntelligenceProject SummaryThis research project is divided mainly into two parts, one of it being the semantic-based similarity search which would find similar Quranic verses based on queries given to the system and was decided to be completed in part 1 of the project, and, the second being letting the user know about the label of consequent whether it agrees or disagrees with the premise given an antecedent. The project, while mainly research, features a web interface that would let the user interact with the research or project’s features.
The interface is intended to allow 3 tasks to be available to users i.e. one of which is the semantic search and two particular ways for performing inference task. The first allows you to perform semantic Quranic verse search and output similar verses on the interface. While from the other two for inference, the first lets you give only an antecedent as an input and have the system find similar verse related to the antecedent, and finally, using that for the premise, it lets the user know the consequent label to confirm if the model inferred and whether it agrees or disagrees with the premise by outputting a consequent label. The second for inference involves given a premise and an antecedent for which the system would similarly output just a consequent label. The inference system, while it is being researched to aid the user in understanding the verses, would potentially be used for shortlisting the verses for an average user instead of having just an ambiguous phrase to ask the expert to clarify, however, the ultimate result might still require an expert’s verdict and validation.
Project Objectives- To find an efficient method to search and compare similarity vectors made from sentence embeddings and keyword extraction techniques.
- To retrieve similar ayahs from the Quran in terms of contextual meanings, identifying the most efficient similarity search techniques.
- To apply inference collectively to similar verses.
- To create a dataset to test inference.
- One of the main objectives of this project is to test different pre-trained sentence embedding and inference models on the Quranic verses and its translations available for Urdu and English, to find which model performs the best according to evaluations for our purpose of the study and discover limitations of the systems.
- To find efficient data structure and algorithms (if any) for storage and retrievals of sentence embeddings and keyword extraction vectors.
- To display the efforts put into the project and to fulfill the core principal requirements of an FYP project, we have designed a web-based application through which the stakeholders could potentially benefit from and interact with the research work.
Language Support of the System:
This section aims to discuss our decision to include and work on the languages for this research project or at least, part 1 of it. We have opted for Arabic since it is the language of Quran, English, as it's the international medium of communication and, hence, support for it is deemed necessary to preserve the international scale of the project, and finally Urdu, which is our national language, its support is added as a gesture of support for our national language.
Data:
The data was abundantly available for Quran and its translations thanks to the previous work of communities. We were able to get the Tanzil Data, thanks to our Supervisor, Dr. Tafseer Ahmed Khan, and from OPUS parallel corpus. For evaluation purposes of the Semantic Textual Similarity task with respect to the Quran, we used subject-wise data available on QuranGo. Figure 3.1 acts as an example of how the data was available whereas, figure 3.2 acts as an example for parallel corpus we used e.g. en-ur.
Pre-processing:
Since the models we used are pre-trained and since the models picked usually performed well for semantic textual similarity (STS) for English and Urdu, we had to only pre-process Arabic translation of The Holy Quran for the removal of diacritics.
Implementation:
Once we have the front-end created, we need it to communicate through our research work which is in python and, hence, to create a backend server as quickly as possible and to handle heavy processing which is required to let the user perform the task, the decision was made to use FastAPI. This lets us develop a fully-fledged back-end server up and running very quickly while also promising better performance than famous python frameworks such as Django and Flask.
Now that we have the frontend and the communication between the processing system ready, we can move on to the engine of the system i.e. the actual method which we experimented with during our research. The Processing block of the system i.e. "Retrieval of Quranic Verses" part would use pre-embed verses of the Quran i.e. all 6236 verses of the Holy Quran from all 3 languages excluding the count of "BismilLah" verses. By doing this, we save the cost (in terms of time and responsiveness) of creating a model and have the embeddings be made for all Quranic verses in advance which can then be used to calculate similarity against using the similarity metric of our choice i.e. "Cosine Similarity". For embeddings, we intend to save them from two top-performing models and then using them to calculate similarity against the query we have from the front-end. Finally, the system sorts the similarity detail given by the deep learning model and returns the top 10 most related verses to the user's front-end interface through REST APIs.
Benefits of the ProjectThis research project is being carried out to solve a problem for Muslims all over the world. The problem being the fact that first, an average person does not know which para, section, or verse of The Holy Quran talks about whatever the concern of the person in question is. Therefore, our method aims to provide a solution to this first problem by using semantic search techniques to find and deliver related verses in response to the query of the user. The second part (literally) of the project aims to solve the other half of this project i.e. Did the user understand the verse correctly? To test himself and shortlist the ayah and results for the respective precedents and antecedents, the user can consult a professional with the shortlisted results. When used in tandem, they essentially cut the cost in half in terms of time and effort.
Technical Details of Final Deliverable- Explore and analyze which techniques, methods, and deep learning models would be beneficial for the purpose of solving the research problem.
- Identify 3 Pre-trained models to use for evaluation and testing in the project and 1 VSM model (Classical Machine learning NLP) approach.
- Use the preceding details to decide which techniques or models to use for our Production (User facing) Application.
- Test and decide on Semantic Search Performers and find pros and cons for each.
- Implement Semantic Search System for APIs.
- Fine-tuning Dataset for Natural Language Inference.
- Find out performer-models for Natural Language Inference.
- Implement Natural Language Inference System for APIs.
- Develop a User Facing Application and communicate with Server through REST APIs.
- Implement Server Application and the relevant APIs and processes.