Bigdata framework for efficient link prediction

Apache Hadoop and Apache Spark has gained fame in the last few years as scalable and big data processing frameworks. Although Apache Spark is more efficient compared to Apache Hadoop due to it?s in memory computation but operations like Broadcast and Shuffle restricts many machine learning and evolu

2025-06-28 16:30:37 - Adil Khan

Project Title

Bigdata framework for efficient link prediction

Project Area of Specialization Artificial IntelligenceProject Summary

Apache Hadoop and Apache Spark has gained fame in the last few years as scalable and big data processing frameworks. Although Apache Spark is more efficient compared to Apache Hadoop due to it’s in memory computation but operations like Broadcast and Shuffle restricts many machine learning and evolutionary algorithms from significant performance gain/speed up. Broadcasts and Actions are the only mechanism of communications between partitions that improves diversity and avoid partitions getting stuck in local optima. Communication between partitions results in network overhead hence performance degradation. We aim to develop a library that would work on top of Apache Spark and would deal the trade-off between network communication and performance gain in a more effective manner. In order to verify proposed library operations we would compare ML/Evolutionary algorithms performance on standard Apache framework with and without our proposed library operations. We would use standard benchmarks for experimentation.

Project Objectives

Speedup will be achieved while reducing network communication. Hence the demand to address performance issue on Big Data platforms can be better addressed. The library would make it more suitable for algorithms to be executed on Big Data Frameworks with a suitable performance while maintaining accuracy. Hence encouraging ML algorithms for scalability.

Project Implementation Method

 Implementation using

Apache Spark

Scala

Apache Maven

Apache HDFS

Scala Eclipse IDE

Benefits of the Project

Speedup will be achieved while reducing network communication. Hence the demand to address performance issue on Big Data platforms can be better addressed. The library would make it more suitable for algorithms to be executed on Big Data Frameworks with a suitable performance while maintaining accuracy. Hence encouraging ML algorithms for scalability.

Technical Details of Final Deliverable

Library in Scala/Spark that would be runnable on Apache Spark clusters

Library Manual

Library Documentation

Final Deliverable of the Project Software SystemType of Industry IT Technologies Artificial Intelligence(AI), Big DataSustainable Development Goals Decent Work and Economic Growth, Partnerships to achieve the GoalRequired Resources
Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
Total in (Rs) 58000
DataBricks-Amazaon AWS Server Price Usage Price based on BTU Equipment15800058000

More Posts