6 Classification Result

As the messages in this corpus (TREC2006) are in chronological order, we run three major simulations over 500, 1000 and 1800 consecutive samples, where 75% of them are spam, and 25% of Ham. For testing, we perform a 10-cross validation and measure the average accuracy for each run. The detail results of this experiment are listed below.

6.1 Text Classification (Subject plus Body)

In the first experiment, we classify the message (subject plus body text) with Naive Bayes and Support Vector Machine:

6.1.1 Naive Bayes
# of document # of features Accuracy
400 3333 (occurrence > 2) 95.8449 %
400 1302 (occurrence > 5) 95.8449 %
800 5445 (occurrence > 2) 91.9668 %
800 2240 (occurrence > 5) 91.9668 %
1200 10533 (occurrence > 2)85.0734 %
1200 4795 (occurrence > 5) 85.4305 %
1800 11819 (occurrence > 2)86.1173 %
1800 5281 (occurrence > 5) 86.1173 %


From the result above, obviously, when the number of feature raises, the accuracy eventual drop downs, because of growing number of noise in data.
6.1.2 Support Vector Machine

We adapt WLSVM, known as a wrapper library for libSVM, because of its great performance (runs much faster than Weka SMO) and flexible supports to several SVM methods (i.e. One-class SVM, nu-SVM, and R-SVM).

# of document# of features Accuracy
400 1302 86.9806 %
800 2240 80.3324 %
1200 4795 82.202 %
1800 5281 83.1858 %


The results stay at an average of 80+%. Combining these two result, we observe that Naive Bayes classify more accurately than what SVM does. However WLSVM shows its extraordinary advantages, in terms of memory usage and running performance, whereas Naive Bayes Classifier takes more significant amount of time and memory on the same sample set.

6.2 URL Classification (Email address plus URL address)

Our second experiment focus on the URL address and Email address appeared in message and from the sender, in which about 60% are from suspensions spam address. We then perform classification with Naive Bayes and KNN Classification:

6.2.1 Naive Bayes
# of samples Accuracy
1629 96.6851 %
8124 84.6847 %
12300 72.2457 %


6.2.2 K Nearest Neighbor
# of samplesAccuracy (k=1)Accuracy (k=3)Accuracy (k=5)
1629 97.8449 % 80.7858 % 78.76 %
8124 96.2457 % 80.6253 % 77. 8449 %
12300 96.6623 % Out of MemoryOut of Memory


From observation, the accuracy achieve the best, when k equals to 1, and eventually decrease, while we increase the value of k. The choice of k are odd number, because we want to break the tie in this binary classification experiment.