DGA detection using XGBoost

This project was part of my IMT 575 course titled 'Data Science III: Scaling, Application, and Ethics'. The aim of this project was to build an application which classifies a domain name as either Benign or DGA domain. First, we generated a master dataset of about 5 million rows. To keep the dataset balanced, we used equal benign and DGA domains. To generate DGA domains, we used 40 DGAs. We then generated multiple features like length of domains, count of capitals, count of digits, count of consecutive consonants, entropy of domain and so on. We then used XGBoost on these features to develop a classification model. We used hyperparameter tuning and got a testing accuracy of 93.16%. The application was built using AWS services like Lambda, API Gateway, and SageMaker.

The project files can be found here.