Semi Supervised learning based IoT Malware Classification

The threat of malware is increasing with every passing day. It becomes quite challenging to detect unknown malware by traditional approaches due to their varying signature. Recently, deep learning-based approaches have grabbed the attention of malware detection. But the main hindrance for de

2025-06-28 16:34:58 - Adil Khan

Project Title

Semi Supervised learning based IoT Malware Classification

Project Area of Specialization Internet of ThingsProject Summary

The threat of malware is increasing with every passing day. It becomes quite challenging to detect unknown malware by traditional approaches due to their varying signature. Recently, deep learning-based approaches have grabbed the attention of malware detection. But the main hindrance for deep learning-based solutions is the scarcity of labeled data, which means that there is a  shortage of labeled data to efficiently train a deep neural network. To address this issue we will use semi-supervised learning-based approaches which will be used to label the unlabeled data. Furthermore, we will compare different semi-supervised learning approaches to find which approach is best suited for the malware dataset.

The best approach will be selected to create a semi-supervised deep learning-based framework to label and categorize the unknown malware samples.

Project Objectives

The Objectives of our project are as follows

  1. To collect unlabeled data from various sources such as Virus Share, Kaggle Datasets

  2. To research on the best ConvNet architectures for our desired output

  3. To Train and Validate Deep Learning Model such as Convolutional Neural Network

  4. Learn the labels of gathered unlabeled datasets using various Semi-Supervised Learning Algorithms

  5. Label the unlabeled dataset

  6. To detect the real-time malware sample within a short duration using less computation power.

Project Implementation Method

We will implement the project in the following steps

  1. Pre Processing:

The data is gathered in ‘.bytes’ format. To apply the Convolutional Neural Network we need the data to be in the image format; therefore, we need to change the ‘.bytes’ files to images using the best method available after research.

After conversion, there is a need for images to have specific dimensions for that the images need to be resized to defined size.

After resizing, the images are needed to be converted to the tensors which are a type of data structure which utilizes the parallelization of hardware resources and fits perfectly on our chosen ConvNet architecture.

  1. Deep Learning Model:

After the Preprocessing step, we will then start research on selecting the best Deep Learning Model for training and validation. We will research different architectures for the convolutional neural network and will analyze which architecture fits best on our goal. After that, we will apply different validation techniques to validate the Convolutional Neural Network such as K-Fold and Hold Out Cross-Validation.

  1. Semi-Supervised Learning:

After selecting the best deep learning model, we will use different state-of-the-art semi-supervised learning approaches such as Pseudo Labeling, Cluster-then-label, etc to utilize the unlabeled dataset which will increase the generalization capability of our deep learning model because ConvNet will then have more data set to be trained on. After applying the Semi-supervised learning, we will then compare the results of all the implemented approaches of semi-supervised learning to analyze which approach is best suited for our goal. 

Benefits of the Project

The benefits of our project are as follows

  1. Contribution of Labeled Malware Data Set using state-of-the-art Semi-supervised learning.

  2. Trained and tested Convolutional Neural Network model which can classify malware accurately. 

  3. The complete framework which will identify any ‘.byte’ format input file by the user as malware or benign, if the file is malware then it will further classify from which class the malware belongs.

  4. The resulting framework from this project would label and classify the malware sample in a cost effective manner.

Technical Details of Final Deliverable
  1. Convolutional Neural Network:

While Implementing ConvNet, we plan to test different architectures such as Naive Architecture, VGG-16, VGG-19, ResNet etc, then we will compare which will be best suited in detecting malware.

Research is needed to fine-tune the hyperparameters of the ConvNet model to achieve the best possible accuracy.

There is a need to take care of different issues such as Class Imbalance problems because our data set has 9 different classes and for each class there is an unequal number of instances. This issue may cause difficulty in classifying each class from another.

  1. Semi-Supervised Learning:

The next deliverable of this project will be a semi-supervised based model. In this project, we will implement and analyze different approaches of semi-supervised learning such as Pseudo Labelling, Cluster-then-label, and Graph-Based Semi-Supervised Learning. After that, we will compare each model to select the best approach for labeling the malware dataset.

  1. A Complete Framework for classifying and labeling malware:

The final deliverable for this project will be a complete framework that will let the user input a ‘.byte’ file and at the backend, we will convert it to the image file and then we will feed it to our ConvNet model which will classify the image into whether it is a malware or not if the file is malware, then it will further classify to let the user know from which malware class the given file belongs.

Final Deliverable of the Project Software SystemCore Industry ITOther Industries Security Core Technology Artificial Intelligence(AI)Other Technologies Internet of Things (IoT)Sustainable Development Goals Good Health and Well-Being for PeopleRequired Resources
Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
Total in (Rs) 70000
GeForce RTX 2070 Equipment70000170000

More Posts