7 Conclusion and Future Work

In the experiment, we have seen the traditional way (text classification) of Bayesian filter usually outperforms other methods in term of training time and performance. This is the reason why it is widely used in commercial email filter software. In addition, we think URL classification results an interesting outcome, and is worth more investigation and experiment in future. However, in our experiment, there are several unsolved issues in the project. The most important and critical attack is Bayesian poisoning, sometimes called good word attack. This occurs when the spammer intends to skew the message statistics, by putting enough ”neutral” text in your spam message, and dropping the score below the statistical threshold of such filters. This has been a good strategy to beat statistical filtering, such as Bayesian Filter, which looks at the content of a message and weigh the presence of spam-related words and phrases against the message as a whole. Other spammer tricks, such as Image Spam or Table-based obfuscation, are not within the scope of this project.

There is no doubt that spam represents a significant, global threat to every internet users. The costs associated with the problem are going out of control, as employees lose productivity and companies spend billions each year to process and store spam messages. However, the good news is current filtering technique perform highly effective, and has strong resistance to most attack. To conclude, by this experiment, we enhance our understanding of the machine learning algorithm by coding and implementing these algorithms.