One of the proposed ways of stopping spam is to enhance or even substitute the existing standards of email transmission by new, spam-proof variants. The main drawback of the commonly used Simple Mail Transfer Protocol (SMTP) is that it provides no reliable mechanism of checking the identity of the message source. The approach is to use simple tests that allow the system to distinguish human senders from robots by asking the user to answer a moderately easy question before granting him the permission to sent a message 1. Another proposal [3] was to establish a small payment for sending an email message, neglectable for a common user, but big enough to prevent a spammer to broadcast millions of messages. An interesting version of this approach is Zmail protocol 2, where a small fee is paid by the sender to the receiver, so that a common user who sends and receives nearly equal amount of messages gets neither damage no profit from using email, while spamming becomes a costly operation.
The idea of automated methods for filtering such spam from legitimate Emails is becoming necessary. The automatic classification can be done with machine learning techniques. In general, such filter needs a collection of labelled trained data with pre-collected reliable judgements, so that a spam filter can classify a new message to be a spam or non-spam, based on these previous experience. In practice, many tools have been developed to help users organize their mailbox, like Hotmail’s3 ”Junk Mail” folder. Emails in these folders were classified as spam, by built-in classifiers, and used as training sample for future classification.
In 1998 the Naive Bayes classifier was proposed for spam recognition [4, 5]. This
new trend classifier became widely known and used due to Paul Graham’s popular
article ”A Plan for Spam” [1]. This classifier, when applied to text, can be
considered an improved learning-based variant of keyword filtering. From Bayes
theorem, a document d, represented with feature vector
=< x1,…,xn >,
belongs to category c is:
| cNB = | argmax cP(c|x1,…,xn) | |||||
| = | argmax cP(x1,…,xn|c) ⋅ P(c) | (Baye’s Theorem) | ||||
| = | argmax c ∏ 1nP(x i|c) ⋅ P(c) | (independence assumption of NB) | ||||
∀c {spam,leg} |
The Naive Bayesian classifier assumes that all the features are statistically independent, that is, the probability of observing the conjunction x1,…,xn), P(x1,…,xn) is just the product of the probabilities for individual attributes. In practice, P(c) are easy to estimate from the frequencies in which category c occurs in the training corpus. P(xi|c) means given a category c, the probability of feature xi occurs in the document. Typically, this conditional probability is represented with document frequency (or TFIDF), where each word is treated as a feature. However, the drawback of this classifier is that X1,…,Xn are conditionally independent is usually overly simplistic in practice.
Naive Bayesian Classifier is an easy to understand an easy to implement classification techniques. Despite of its impractical independence assumption, this classifier often performs surprisingly effective due to its simplicity, elegance, and robustness. Thus, it is widely used in areas of text classification, such as spam filtering.
The k-Nearest Neighbor (k-NN) classifier was proposed for spam filtering by Androutsopoulos et al. [6]. With this classifies the decision is made as follows: the algorithm assigns to each new unseen instance class from the majority vote among the k training instances that are closest to the unseen instance. Euclidean Distance is typically used to calculate similarity between two given instances. For instance, xi =< xi1,xi2,…,xin > and xj =< xj1,xj2,…,xjn >, their distance is:

However, the choice of k is one of the key issues that affect the performance of this algorithm. If k is too small, then the result can be sensitive to noise points; if k is too large, then the neighborhood may include too many points from other classes. In general, k-NN is very simple to understand and implement. Nevertheless, the drawback is that k-NN is very sensitive to redundant features and noise data, so we might be able to improve the accuracy, if we adopt some noise reduction techniques for k-NN.
Another classifier proposed for spam filtering is Support Vector Machine (SVM), which was proposed by Vapnik [7] in 90’s, and ever since many scientists have been working on related research about SVM on text classification. Given the training samples and a predefined transformation ϕ : Rd → F, which maps the features to a transformed feature space, the classifier separates the samples of the two classes with a hyperplane in the transformed feature space, building a decision rule of the following form:

where K(ϕ(u),ϕ(v)) = ϕ(u) ⋅ ϕ(v) is the kernel function and αi,i = 1,…,n and b maximize the margin of the separating hyperplane. The value -1 corresponds to cleg, 1 corresponds to cspam. In other words, SVM aims to find a maximum margin separating hyper-plane between two classes of data set. That is we are given training data to maximize the nearest distance between a point in one separated hyperplane and a point in the other separated hyperplane.
Several remarkable properties of SVM listed below showed its substantial performance in text classification[8]:
Nevertheless, the original SVM is designed for binary classification and a direct solution to the multi-class problem using SVM is usually avoided, due to the various complexities. The typical approach is to use a combination different classifying methods (usually accompanied with SVM) to solve a multi-class classification problem.
Measure of feature relevance used for ordering features. Each measure applies to a feature, and cspam and cleg are the labels of spam class and legitimate mail class correspondingly. However, the typical way of representing the frequency of a word occurred in the document, might lead to a problem. For frequent terms, such as ”the”, would have a lot higher occurrence than those for rare terms, like ”unsubscribe”. The common words appears everywhere in the document, nevertheless, are rather meaningless than the rare but highly suspicious terms. Therefore, I adopted another similar scoring function to weigh each word in the document:

where wt,d is the weight of term (token) t in the document (message) d; tft,d is the number of occurrences of the term t in the document d; whereas dft is the number of messages in which the term t occurs, and n, as above, is the total number of documents in the training set.