Machine Learning Algorithms Analysis of Synthetic Minority Oversampling Technique (SMOTE): Application to Credit Default Prediction
Abstract
Credit default prediction is an important problem in financial risk management. It aims to determine the possibility of borrowers failing on their loan commitments. However, dataset to guide Machine Learning modeling procedure for data driven support suffers from class imbalance. Class imbalance in Machine Learning is an unbalanced distribution of classes within a dataset. This problem often arises in classification jobs if the distribution of classes or labels in a dataset is not uniform. To overcome this issue, just resample by adding or removing entries from the minority or majority classes.
The present study looks on the efficacy of classification algorithms employing various data balancing approaches. The dataset was collected from a well-known commercial bank in Ghana. To resolve the imbalance, three data balancing approaches were used: under-sampling, oversampling, and the synthetic minority oversampling technique (SMOTE). Findings, with the exception of the SMOTE dataset, XGBoost consistently beat the other classifiers across the other datasets in terms of AUC. Random forest, decision tree, and logistic regression all performed well and might be utilized as alternatives to XGBoost classifiers for developing credit scoring models. The findings demonstrate that classifiers trained on balanced datasets have higher sensitivity scores than those trained on the original skewed dataset, while maintaining their capacity to differentiate between defaulters and non-defaulters. This demonstrates the value of data balancing strategies in increasing models' ability to anticipate minority class occurrences, Hence, the major discovery is that oversampling outperforms under-sampling across classifiers and evaluation measures is affirmed.
Keywords
Full Text:
PDFReferences
References
B. Baesens, T. Van Gestel, S. Viaene, M. Stepanova, J. Suykens, and J. Vanthienen. Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the operational research society, 54(6):627–635, 2003.
M. M. C. Batista G E, Prati R C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 2004.
T. Bellotti and J. Crook. Support vector machines for credit scoring and discovery of significant features. Expert systems with applications, 36(2):3302–3308, 2009.
A. Blanco, R. Pino-Mej´ıas, J. Lara, and S. Rayo. Credit scoring models for the microfinance industry using neural networks: Evidence from peru. Expert Systems with applications, 40(1):356–364, 2013.
L. Breiman. Random forests. 2001.
I. Brown and C. Mues. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39 (3):3446–3453, 2012.
P. A. S. S. Chan P, Fan W. Distributed data mining in credit card fraud detection. Intell. Syst. their Appl., IEEE, 14(6):67–74, 1999.
H. L. B. K. Chawla N, Lazarevic A. Smoteboost: Improving prediction of the minority class in boosting. Knowledge Discovery in Databases:PKDD 2003, pages 107–119, 2003.
H. L. O. K. W. P. Chawla N. V., Bowyer K. W. Synthetic minority over-sampling technique(smote). Journal of artificial intelligence research, 16:321–357, 2002.
T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In editor, editor, Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016.
T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
G. E. Chen S, Haibo H. Ramoboost : Ranked minority oversampling in boosting. IEEE Trans. Neural Network, 21(10):1624–1642, 2010.
S. H. H. S. G. A. Cohen G, Hilario M. Learning from imbalanced data in surveillance of nosocomial infection. Artificial intelligence in medicine, 37(1):7–18, 2006.
X. Dastile, T. Celik, and M. Potsane. Statistical and machine learning models in credit scoring: A systematic literature survey. 2020.
Dina and Atiya. A comprehensive analysis of synthetic minority oversampling technique (smote) for handling class imbalance. Information Science, pages 32–64, 2019.
A. Dina and Kamalov. A theoretical distribution analysis of synthetic minority oversampling technique (smote) for imbalanced learning. Machine Learning, 2023.
T. A. Duman E, Ekinci Y. Comparing alternative classifiers for database marketing: The case of imbalanced datasets. Expert Systems with Applications, 39(1): 48–53, 2012.
A. A. F. Elreedy D. A comprehensive analysis of synthetic minority oversampling technique (smote) for handling class imbalance. Information Sciences, 505:32–64, 2019.
W. erbeke, K. Dejaeger, D. Martens, J. Hur, and B. Baesens. New insights into churn prediction in the telecommunication sector: A profit driven data mining approach. 2012.
Z. J. C. P. Fan W, Stolfo S. Adacost: misclassification cost-sensitive boosting. A. A. Fayed H. A novel template reduction approach for the-nearest neighbor method. IEEE Trans. Neural Network, 20(5):890, 2009.
H. F. C. N. V. Fernndez A., Garca S.Y. Guo. Credit risk assessment of p2p lending platform towards big data based on bp neural network. 2019.
S. J. M. G. Y. H. B. G. Haixiang G, Yijing L. Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73:220–239, 2017.
D. J. Hand andW. E. Henley. Statistical classification methods in consumer credit scoring: a review. Journal of the Royal Statistical Society: Series A (Statistics in Society), 160(3):523–541, 1997.
G. E. A. He H. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284, 2009.
S. S. Japkowicz N. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429–449, 2002.
E. D. Kamalov F, Atiya A F. Partial resampling of imbalanced data. arXiv preprint arXiv: 2207. 04631, 2022.
M. Kaur, Pannu. A systematic review on imbalanced data challenges in machine learning; application and solutions. ACM Computing Surveys, 53(4):1–36, 2019.
K. N. Kerdprasop K. Data preparation techniques for improving rare class prediction 2 intelligent methods for predicting. Proceedings of the 13th WSEAS International Conference on mathematical methods, pages 204–209, 2011.
A. C. K. Lematre G, Nogueira F. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1):559–563, 2017.
Y. Li. Credit risk prediction based on machine learning methods. In editor, editor, 2019 14th International Conference on Computer Science Education (ICCSE), 2019.
Y. Li and W. Chen. A comparative performance assessment of ensemble learning for credit scoring. 2020.
Z. Z. Liu X,Wu J. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst., Man, Cybernetics, Part B (Cybernetics), 39(2):539–550, 2009.
V. L´opez, A. Fern´andez, S. Garc´ıa, V. Palade, and F. Herrera. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. 2013.
X. Ma, J. Sha, D. Wang, Y. Yu, Q. Yang, and X. Niu. Study on a prediction of p2p network loan default based on the machine learning lightgbm and xgboost algorithms according to different high dimensional data cleaning. Electronic Commerce Research and Applications, 31:24–39, 2018.
M. Malekipirbazari and V. Aksakalli. Risk assessment in social lending via random forests. 2015.
Z. I. Mani I. Knn approach to unbalanced data distributions: a case study involving information extraction. Proceedings of workshop on learning from imbalanced datasets, 126, 2003.
S. H. Mayabadi S. Two density-based sampling approaches for imbalanced and overlapping data. Knowledge-Based Systems, 241:108217, 2022.
B. G. E. Monard M. C. Learmng with skewed class distrihutions. Advances in Logic, Artificial Intelligence, and Robotics: LAPTEC.
M. Moniz and H. Monteiro. No free lunch in imbalanced learning. Knowledge Based System, 227:107222, 2021.
J. R. Quinlan. Induction of decision trees. 1986.
N. R. Rao R, Krishnan S. Data mining for improved cardiac care. ACM SIGKDD Explorations Newsletter, 8(1):3–10.
ML Algorithms Analysis of SMOTE: Application to Credit Default Prediction M. A. Roweida M, Jumanah R. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. 11th International Conference on Information and Communication Systems, 2020.
H. J. N. A. Seiffert C, Khoshgoftaar T. Rusboost: A hybrid approach to alleviating class imbalance. IEEE Trans. Syst., Man, and Cybernatics, 40(1):185–197, 2010.
K. M. S. Sun Y., Wong A. K. Classification of imbalanced data: A review.
W. A. W. Y. Sun Y, Kamel M. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358–3378, 2007.
C.Wang, D. Han, Q. Liu, and S. Luo. A deep learning approach for credit scoring of peer-to-peer lending using attention mechanism lstm. 2018.
L. X. Z. N. C. H.Wang L, Han M. Review of classification methods on unbalanced data sets. IEEE Access, 9:64606–64628, 2021.
Wilson. Asymptotic properties of nearest neighbor rules using edited data. Transactions on Systems, Man, and Cybernetics, 3:408421, 1972.
Z. Z. Y. C. Z. Y. Z. Y. Yan Y, Jiang Y. Ldas: Local density-based adaptive sampling for imbalanced data classification. Expert Systems with Applications, 191:116213, 2022.
DOI: http://dx.doi.org/10.23755/rm.v53i0.1601
Refbacks
- There are currently no refbacks.
Copyright (c) 2024 Emmanuel de-Graft Johnson Owusu-Ansah, Richard Doamekpor, Richard Kodzo Avuglah, Yaa Kyere Adwubi

This work is licensed under a Creative Commons Attribution 4.0 International License.
Ratio Mathematica - Journal of Mathematics, Statistics, and Applications. ISSN 1592-7415; e-ISSN 2282-8214.