Sensitivity of Machine Learning Algorithms to Dataset Drift for the Natural Language Processing Application of Spam Filters

Tonya Fields, Pace University

Abstract

The datum used for many machine learning projects drifts over time. The problem of dataset drift causes the performance of the models to deteriorate. This can have devastating consequences for critical real-world applications that rely on the accuracy and robustness of the model, such as autonomous vehicles, fraud detection, medical diagnosis, etc. The resilience to dataset drift will impact the degree and ways the model suffers in production. The impact of data drift depends on the selected algorithms, as well as the application domains. Application domains involving graphics and pattern recognitions, such as traffic-toll systems and image classification are less likely to suffer from data drift because the data used for training is static and is expected to have the same distribution that will be seen in production. On the other hand, the output of daily business processes that rely on natural language understanding for text classification involving user behavior are subject to change over time. We research the stability of four popular high-performing machine learning algorithms for the text-classification task of email spam classification. In this work, we compare the resilience to dataset drift of the Random Forest (RF) algorithm, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) Network, as well as the pre-trained transformers architecture of the Bidirectional Encoder Representation from Transformer (BERT) developed by Google. To simulate various amounts of data drift, we create four hybrid datasets using the SpamAssassain benchmark dataset, combined with various percentages of emails from the Enron benchmark dataset. Our study found that for the specific Natural Language Understanding tasks of text classification for use in spam filters, the RF, CNN, and LSTM were less likely to suffer from data drift. These three models suffered only a 1% loss compared to an 11% loss for BERT when using a dataset representing 60% data change. When using a dataset representative of 90% data change, the BERT model suffered a 16% loss compared to the 7% loss of the RF and LSTM, and the 6% loss of the CNN. Overall, we found the RF, CNN, and LSTM algorithms were less likely to be impacted by dataset drift when used for spam filter applications.

Subject Area

Information science|Computer science|Information Technology

Recommended Citation

Fields, Tonya, "Sensitivity of Machine Learning Algorithms to Dataset Drift for the Natural Language Processing Application of Spam Filters" (2021). ETD Collection for Pace University. AAI28775969.
https://digitalcommons.pace.edu/dissertations/AAI28775969

Share

COinS

Remote User: Click Here to Login (must have Pace University remote login ID and password. Once logged in, click on the View More link above)