Comparative Performance Analysis of BERT and RoBERTa for Email Spam Classification

Purwadi; Hafizh Dzaky Ahya Gemilang

doi:10.59934/jaiea.v5i2.1968

Authors

Purwadi Universitas Amikom Purwokerto
Hafizh Dzaky Ahya Gemilang Universitas Amikom Purwokerto

DOI:

https://doi.org/10.59934/jaiea.v5i2.1968

Keywords:

BERT, Email Spam Classification, RoBERTa, Text Classification, Transformer Models

Abstract

The rapid advancement of information technology has increased the use of email as a primary digital communication medium, while also contributing to the growing volume of spam emails that threaten productivity and information security through phishing and malware. An accurate and adaptive email spam classification system is therefore required. This study aims to analyze and compare the performance of BERT and RoBERTa transformer models for email spam classification. An experimental research approach was employed using an email dataset consisting of spam and non-spam (ham) classes. The research process includes data collection, text preprocessing, model fine-tuning, and performance evaluation using accuracy, precision, recall, F1-score, and confusion matrix metrics. The results show that both BERT and RoBERTa achieve high classification performance. However, RoBERTa demonstrates superior results, particularly in terms of spam recall and overall accuracy, indicating a stronger ability to detect spam emails. This advantage is attributed to RoBERTa’s optimized pre-training strategy, which improves contextual semantic understanding of email content. In conclusion, RoBERTa is more effective than BERT for email spam classification and can serve as a reliable model for developing robust transformer-based spam detection systems.

Downloads

Download data is not yet available.

References

M. B. M. Amin et al., “Deteksi Spam Berbahasa Indonesia Berbasis Teks Menggunakan Model Bert,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 11, no. 6, pp. 1291–1302, Dec. 2024, doi: 10.25126/jtiik.2024118121.

Y. R. Hutagaol and Y. Arifin, “KLASIFIKASI SPAM EMAIL BERBASIS SEMANTIK MENGGUNAKAN METODE BERT SEMANTIC-BASED EMAIL SPAM CLASSIFICATION USING BERT METHOD,” Journal of Information Technology and Computer Science (INTECOMS), vol. 7, no. 5, pp. 1823–1836, 2024, doi: https://doi.org/10.31539/intecoms.v7i5.12515.

F. Y. Arini et al., “Optimasi algoritma deteksi spam email dengan BERT-MI dan jaringan dense,” Jurnal Computer Science and Information Technology (CoSciTech), vol. 6, no. 2, pp. 319–328, 2025, doi: 10.37859/coscitech.v6i2.9460.

M. Rustam, A. Brotokuncoro, and R. Roestam, “Deteksi Email Spam dengan Continuous Bag-Of-Words dan Random Forest,” Ranah Research Journal (R2J), vol. 6, no. 4, pp. 758–765, 2024, doi: 10.38035/rrj.v6i4.

D. Tejo Arum and A. Ichsan Pradana, “IMPLEMENTASI BIDIRECTIONAL ENCODER REPRESENTATIONS FROM TRANSFORMERS (BERT) UNTUK KLASIFIKASI SPAM PADA EMAIL,” Jurnal Mahasiswa Teknik Informatika (JATI), vol. 9, no. 2, pp. 2491–2496, 2025, doi: https://doi.org/10.36040/jati.v9i2.13114.

M. A. Sofyan, N. Rahaningsih, and R. D. Dana, “DETEKSI SMS SPAM BERBAHASA INDONESIA MENGGUNAKAN ALGORITMA SUPPORT VECTOR MACHINE,” Jurnal Mahasiswa Teknik Informatika (JATI), vol. 8, no. 3, pp. 3071–3079, 2024, doi: https://doi.org/10.36040/jati.v8i3.9532.

I. Fauzi et al., “COMPARATIVE STUDY OF SPAM EMAIL CLASSFICATION DECISION TREE BETWEEN USING CART AND J48,” Jurnal Mahasiswa Teknik Informatika (JATI), vol. 9, no. 3, pp. 4032–4036, 2025, doi: https://doi.org/10.36040/jati.v9i3.13533.

I. Fitriyanto, T. Radillah, L. Tambunan, and A. Fauziyyah, “IMPLEMENTASI METODE RANDOM FOREST PADA TEXT MINING UNTUK KLASIFIKASI SMS SPAM MENGGUNAKAN PYTHON,” INFORMATIKA: Jurnal Informatika, Manajemen dan Komputer, vol. 17, no. 1, pp. 2580–3042, 2025, doi: http://dx.doi.org/10.36723/juri.v17i1.742.

H. P. Tarigan, “Integrasi Chatbot Berbasis NLP pada Sistem Layanan Akademik Universitas,” Jurnal Komputer, vol. 3, no. 1, pp. 13–18, 2024, doi: https://doi.org/10.70963/jk.v3i1.110.

R. Merdiansah and A. Ali Ridha, “Analisis Sentimen Pengguna X Indonesia Terkait Kendaraan Listrik Menggunakan IndoBERT,” Jurnal Ilmu Komputer dan Sistem Informasi (JIKOMSI, vol. 7, no. 1, pp. 221–228, 2024, doi: https://doi.org/10.55338/jikomsi.v7i1.2895.

N. Sofa, F. S. Utomo, and R. E. Saputro, “Eksplorasi Model Hybrid Transformer-Latent Semantic Analysis (LSA) Untuk Pemahaman Konteks Teks Berita Berbahasa Indonesia,” Jurnal Pendidikan dan Teknologi Indonesia, vol. 5, no. 5, pp. 1239–1252, May 2025, doi: 10.52436/1.jpti.662.

A. Surahman Sulaeman, A. Sujjada, and I. Lucia Kharisma, “Penerapan Algoritma Cerdas Bidirectional Encoder Refresentations From Transformers Dalam Menganalisis Opini Publik Terhadap Produk Yang Mengalami Boikot,” Jurnal Inovtek Polbeng – Seri Informatika, vol. 9, no. 1, pp. 460–473, 2024, doi: https://doi.org/10.35314/isi.v9i1.4252.

F. N. Budiman, W. Witanti, and P. Nurul Sabrina, “Analisis Sentimen Ulasan Aplikasi CapCut Menggunakan Model RoBERTa Dengan Fitur Ekstraksi Word2vec,” Jurnal Algoritma, vol. 22, no. 2, pp. 358–369, Nov. 2025, doi: 10.33364/algoritma/v.22-2.2480.

I. Maulana, “Pengaruh Penggunaan Code.org Sebagai Media Pengenalan Coding Dalam Mata Pelajaran Informatika MTS Nurul Huda Jubang Kelas VII,” JIEP: Jurnal Inovasi dan Evaluasi Pembelajaran, vol. 1, no. 1, pp. 32–42, 2024, [Online]. Available: https://jeip.ipbcirebon.ac.id/

Nur Saida and Muhammad Yasin, “Implementasi Metode Learning Vector Quantization (LVQ) untuk Klasifikasi Jumlah Penduduk Menurut Jenis Kelamin dan Kabupaten di Sumatera Utara,” Jurnal Teknik Informatika dan Teknologi Informasi, vol. 5, no. 3, pp. 29–43, Oct. 2025, doi: 10.55606/jutiti.v5i3.6083.