Evaluation Of The C4.5 Decision Tree and Random Forest Classification Algorithms in Predicting Diabetes

Authors

  • Caecar Mikha Krisnanda Universitas Sanata Dharma
  • Kristiawan Dwi Usmanto Institut Teknologi Sepuluh Nopember https://orcid.org/0009-0002-8879-5399
  • Jonathan Jason Kristanto Southern Taiwan University of Science and Technology

DOI:

https://doi.org/10.25170/jurnalelektro.v19i1.7876

Keywords:

C4.5 Decision Tree, Random Forest, Data Balancing, Hyperparameter Tuning, Imbalanced Data, Diabetes Prediction

Abstract

This study investigates diabetes prediction as a binary classification task using the C4.5 Decision Tree and Random Forest algorithms on the Pima Indians Diabetes dataset. The objective of this study is to compare the performance of both algorithms under three reported experimental settings: without data balancing, with data balancing, and with hyperparameter tuning without balancing. The dataset consists of 768 records, including 500 non-diabetes cases and 268 diabetes cases. The preprocessing stage included data cleaning, Box-Cox transformation, min-max normalization, feature selection, and data splitting into 80% training data and 20% test data. Model performance was evaluated using accuracy, precision, recall, and F1-score through 3-fold, 5-fold, and 9-fold cross validation. The results show that Random Forest consistently outperformed the C4.5 Decision Tree across all reported settings. Under the non-balancing condition, Random Forest achieved the highest accuracy of 77.82%, while C4.5 achieved 69.65%. After applying data balancing, the performance of both models improved, with Random Forest achieving the best overall reported accuracy of 84.19%, compared with 75.68% for C4.5. Under hyperparameter tuning without balancing, Random Forest achieved 78.18%, while C4.5 achieved 74.18%. These findings indicate that Random Forest is more robust and effective than the C4.5 Decision Tree for diabetes prediction, and that data balancing contributes more significantly to performance improvement than hyperparameter tuning alone. 

References

[1] E. Indriyani, L. and T. K. Dewi, "Penerapan Senam Kaki Diabetes Melitus Terhadap Kadar Glukosa Darah Pada Penderita Diabetes Melitus di Puskesmas Yosomulyo," Jurnal Cendikia Muda, pp. 252-259, 2023.

[2] Z. D. R. Sari, J. Jasmir and Y. Arvita, "Penerapan Data Mining Untuk Prediksi Penyakit Diabetes Menggunakan Algoritma C4.5," Jurnal Informatika Dan Rekayasa Komputer (JAKAKOM), vol. 4, no. 1, pp. 827-834, 2024.

[3] A. Perdana, A. Hermawan and D. Avianto, "Analyze Important Features of PIMA Indian Database For Diabetes Prediction Using KNN," Jurnal SISFOKOM (Sistem Informasi dan Komputer), vol. 12, no. 1, pp. 70-75, 2023.

[4] A. Mousa, W. Mustafa, R. B. Marqas and S. H. M. Mohammed, "A Comparative Study of Diabetes Detection Using The Pima Indian Diabetes Database," Journal of University of Duhok, vol. 26, no. 2, pp. 277-288, 2023.

[5] A. Ram and H. Vishwakarma, "Diabetes Prediction using Machine learning and Data Mining Methods," in IOP Conference Series: Materials Science and Engineering, 2020.

[6] R. Rousyati, A. N. Rais, E. Rahmawati and R. F. Amir, "Prediksi Pima Indians Diabetes Database Dengan Ensemble Adaboost Dan Bagging," Evolusi: Jurnal Sains dan Manajemen, vol. 9, no. 2, pp. 36-42, 2021.

[7] D. S. Rahayu, N. J. Afifah and S. Intan, "Klasifikasi Penyakit Diabetes Melitus Menggunakan Algoritma C4.5, Support Vector Machine (SVM) dan Regresi Linear," in SENTIMAS: Seminar Nasional Penelitian dan Pengabdian Masyarakat, 2023.

[8] W. Nugraha and A. Sasongko, "Hyperparameter Tuning pada Algoritma Klasifikasi dengan Grid Search," SISTEMASI: Jurnal Sistem Informasi, vol. 11, no. 2, pp. 391-401, 2022.

[9] D. Ismafillah, T. Rohana and Y. Cahyana, "Analisis algoritma pohon keputusan untuk memprediksi penyakit diabetes menggunakan oversampling smote," Infotech: Jurnal Informatika & Teknologi, vol. 4, no. 1, pp. 27-36, 2023.

[10] A. T. Akbar, H. Prapcoyo and R. Husaini, "SMOTE and K-Means Preprocessing for Classification by Logistic Regression on Pima Indian Diabetes Dataset," Telematika: Jurnal Informatika dan Teknologi Informasi, vol. 20, no. 2, pp. 238-249, 2023.

[11] D. C. P. Buani, "Deteksi Dini Penyakit Diabetes dengan Menggunakan Algoritma Random Forest," Evolusi: Jurnal Sains dan Manajemen , vol. 12, no. 1, pp. 1-8, 2024.

[12] S. Marimuthu, T. Mani, T. D. Sudarsanam, S. George and L. Jeyaseelan, "Preferring Box-Cox transformation, instead of log transformation to convert skewed distribution of outcomes to normal in medical research," Clinical Epidemiology and Global Health, vol.15, pp. 1-5, 2022.

[13] I. Tasin, T. U. Nabil, S. Islam and R. Khan, "Diabetes prediction using machine learning and explainable AI techniques," Healthcare Technology Letters, vol. 10, pp. 1-10, 2022.

[14] H. Bichri, A. Chergui and M. Hain, "Investigating the Impact of Train / Test Split Ratio on the Performance of Pre-Trained Models with Custom Datasets," International Journal of Advanced Computer Science and Applications, vol. 15, no. 2, pp. 331-339, 2024.

[15] "Comparison of Feature Selection with Information Gain Method in Decision Tree, Regression Logistic and Random Forest Algorithms," Journal of Applied Business and Technology, vol. 5, no. 3, pp. 146-153, 2023.

[16] E. Afrianto, J. E. Suseno and B. Warsito, "Decision Tree Method with C4.5 Algorithm for Students Classification Who is Entitled to Receive Indonesian Smart Card (KIP)," in INCITEST 2020, 2020.

[17] P. Gulati , A. Sharma and M. Gupta, "Theoretical Study of Decision Tree Algorithms to Identify Pivotal Factors for Performance Improvement: A Review," International Journal of Computer Applications, vol. 141, no. 14, pp. 19-25.

[18] G. S. Reddy and S. Chittineni, "Entropy based C4.5-SHO algorithm with information gain optimization in data mining," PeerJ Computer Science, pp. 1-22, 2021.

[19] H.-B. Wang and Y.-J. Gao, "Research on C4.5 algorithm improvement strategy based on MapReduce," Procedia Computer Science, vol. 183, pp. 160-165, 2021.

[20] X. Meng, P. Zhang and D. Zhang, "Decision Tree for Online Voltage Stability Margin Assessment Using C4.5 and Relief-F Algorithms," Energies, pp. 1-13, 2020.

[21] A. Mohanty and G. Gao, "A survey of machine learning techniques for improving Global Navigation Satellite Systems," EURASIP Journal on Advances in Signal Processing, vol. 2024, no. 73, pp. 1-40, 2024.

[22] I. K. Pious, A. Rajalakshmi, P. K. R, V. C. M, M. Nalini and S. S. R, "Enhancing Prediction Accuracy Through Random Forest in Classification and Regression," in Smart Technologies for Sustainable Development Goals (ICSTSDG), 2024.

[23] A. Efendi, I. Fitri and G. W. Nurcahyo, "Improvement of Machine Learning Algorithms with Hyperparameter Tuning on Various Datasets," in Future Technologies for Smart Society (ICFTSS), 2024.

Published

2026-04-28