Bigdata framework for efficient link prediction
Apache Hadoop and Apache Spark has gained fame in the last few years as scalable and big data processing frameworks. Although Apache Spark is more efficient compared to Apache Hadoop due to it?s in memory computation but operations like Broadcast and Shuffle restricts many machine learning and evolu
2025-06-28 16:30:37 - Adil Khan
Bigdata framework for efficient link prediction
Project Area of Specialization Artificial IntelligenceProject SummaryApache Hadoop and Apache Spark has gained fame in the last few years as scalable and big data processing frameworks. Although Apache Spark is more efficient compared to Apache Hadoop due to it’s in memory computation but operations like Broadcast and Shuffle restricts many machine learning and evolutionary algorithms from significant performance gain/speed up. Broadcasts and Actions are the only mechanism of communications between partitions that improves diversity and avoid partitions getting stuck in local optima. Communication between partitions results in network overhead hence performance degradation. We aim to develop a library that would work on top of Apache Spark and would deal the trade-off between network communication and performance gain in a more effective manner. In order to verify proposed library operations we would compare ML/Evolutionary algorithms performance on standard Apache framework with and without our proposed library operations. We would use standard benchmarks for experimentation.
Project ObjectivesSpeedup will be achieved while reducing network communication. Hence the demand to address performance issue on Big Data platforms can be better addressed. The library would make it more suitable for algorithms to be executed on Big Data Frameworks with a suitable performance while maintaining accuracy. Hence encouraging ML algorithms for scalability.
Project Implementation MethodImplementation using
Apache Spark
Scala
Apache Maven
Apache HDFS
Scala Eclipse IDE
Benefits of the ProjectSpeedup will be achieved while reducing network communication. Hence the demand to address performance issue on Big Data platforms can be better addressed. The library would make it more suitable for algorithms to be executed on Big Data Frameworks with a suitable performance while maintaining accuracy. Hence encouraging ML algorithms for scalability.
Technical Details of Final DeliverableLibrary in Scala/Spark that would be runnable on Apache Spark clusters
Library Manual
Library Documentation
Final Deliverable of the Project Software SystemType of Industry IT Technologies Artificial Intelligence(AI), Big DataSustainable Development Goals Decent Work and Economic Growth, Partnerships to achieve the GoalRequired Resources| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| Total in (Rs) | 58000 | |||
| DataBricks-Amazaon AWS Server Price Usage Price based on BTU | Equipment | 1 | 58000 | 58000 |