5 Experiment
5.1 Data-Set
TREC 2006 Public Spam Corpora
In these experiments, we have also used the Spam Corpus provided for the TREC
2006 Spam Filtering Track.
5.2 Flow of Preprocess Work
My Spam filtering architecture works as a generalized form for spam classification.
Given an input text we perform the following sequential operations:
- MIME Normalization
To unpacking MIME encodings to a common representation. In
particular, parsing the MIME Header along with body text is extremely
useful. In addition, we check the character set to be most meaningful to
us, for instance, a typical English based user, this is base ASCII (also
known as Latin1) where accents are not significant.
- Tokenized Transformation
We use a regular expression to segment the incoming text into tokens (text
strings) and perform following transformation:
- Unicode Normalization
In order to fix accented character, we transforms this accented
text into an equivalent composed or decomposed form in normal
English.
- HTML de-obfuscation
In some rare cases, HTML is an essential part of the message, but
HTML also provides an incredibly rich environment to obscure
content from a machine classifier while retaining the content for
a human viewer. In particular, spammers often insert nonsense
tags to break up otherwise recognizable words, and use font and
foreground colors to hide hopefully dis-incriminating keywords.
We adopt HTMLParser (.jar) package to parse the body text out.
This package is extremely useful, when we try to fetch URL and
email address for later experiment.
- Spell-Correction
From the theory mentioned above, P(c) can be fetched by reading
a default dictionary files in training, and enumerate all possible
corrections of {c1,c2,…,} for w, the misspelling word, by checking
the edit distances between w and the corrected {c1,c2,…,}.
However, this mechanism is not completely implemented by the
time the project is due, and this would be a future improvement
for this email filter.
- Feature Extraction
- The first step in text classification is to convert documents,
represented by strings of characters, into a suitable format for
the classification task. During the conversion, we determine the
key phrases by viewing the occurrence of word in the feature
space(vocabulary). We use the StringToWordVector filter in
WEKA to convert the original article data into a new data set
with word frequency for each word. The more frequent a word
appears in the data set, the more important the word represents
to the document.
- However, words with high frequency are commonly the daily
words we use everyday (like ’a’, ’the’, etc). Those stop words, also
supported by WEKA, should be omitted, and the experiments
have shown stop words occupy one-fourth of the whole feature
space.
- Last but not least, we filter numbers, punctuation, and etc to
reduce the degree of obfuscation to the classifier.
5.3 Required Libraries
Here are the list of libraries that we have adopted for this project
- HTMLParser(v 1.6)
is a fast Java package used to parse HTML documents. Primary features
include data transformation, data extraction, filters, custom tags, and
easy-to-use JavaBeans.
- WEKA (v 3.5.7)
is a well-developed Java package that supports a collection of machine
learning algorithms for data mining tasks. WEKA supports data pre-processing,
classification, regression, clustering, association rules, and visualization.
- JavaMail
provides a platform-independent and protocol-independent framework
to build mail and messaging applications.
- Snowball is
a popular package to perform stemming process, which reduces inflected
(or sometimes derived) words to their stem or base form.
- libSVM
is an integrated software for support vector classification, regression,
and distribution estimation (one-class SVM ). It also supports multi-class
classification.
- WLSVM ,
abbreviated as Weka LibSVM, combines the merits of the WEKA and
libSVM. WLSVM can be viewed as an implementation of the LibSVM
running under Weka environment.