As the messages in this corpus (TREC2006) are in chronological order, we run three major simulations over 500, 1000 and 1800 consecutive samples, where 75% of them are spam, and 25% of Ham. For testing, we perform a 10-cross validation and measure the average accuracy for each run. The detail results of this experiment are listed below.
In the first experiment, we classify the message (subject plus body text) with Naive Bayes and Support Vector Machine:
| # of document | # of features | Accuracy |
| 400 | 3333 (occurrence > 2) | 95.8449 % |
| 400 | 1302 (occurrence > 5) | 95.8449 % |
| 800 | 5445 (occurrence > 2) | 91.9668 % |
| 800 | 2240 (occurrence > 5) | 91.9668 % |
| 1200 | 10533 (occurrence > 2) | 85.0734 % |
| 1200 | 4795 (occurrence > 5) | 85.4305 % |
| 1800 | 11819 (occurrence > 2) | 86.1173 % |
| 1800 | 5281 (occurrence > 5) | 86.1173 % |
We adapt WLSVM, known as a wrapper library for libSVM, because of
its great performance (runs much faster than Weka SMO) and flexible
supports to several SVM methods (i.e. One-class SVM, nu-SVM, and
R-SVM).
| # of document | # of features | Accuracy |
| 400 | 1302 | 86.9806 % |
| 800 | 2240 | 80.3324 % |
| 1200 | 4795 | 82.202 % |
| 1800 | 5281 | 83.1858 % |
Our second experiment focus on the URL address and Email address appeared in message and from the sender, in which about 60% are from suspensions spam address. We then perform classification with Naive Bayes and KNN Classification:
| # of samples | Accuracy |
| 1629 | 96.6851 % |
| 8124 | 84.6847 % |
| 12300 | 72.2457 % |
| # of samples | Accuracy (k=1) | Accuracy (k=3) | Accuracy (k=5) |
| 1629 | 97.8449 % | 80.7858 % | 78.76 % |
| 8124 | 96.2457 % | 80.6253 % | 77. 8449 % |
| 12300 | 96.6623 % | Out of Memory | Out of Memory |