3 Experiment

3.1 Data Collection

For this experiment, we adopted 20-Newsgroup dataset 1 selected 5 arbitrary classes of news groups from the collection for our experiments.

3.2 Architecture


pict

Figure 3: The Overview Scheme of the Implementation


We have implemented several program in Java for this project. In the front end, we have a newsgroup parser to extract plain text, filter all relatively non-important keywords, and convert the data set into designated SVMLight input. Then, we use a decision-tree-like structure, implemented with SVMLight, to perform the classification, evaluated with information gain at each level. Finally, we perform K cross validation for testing purposes.

3.2.1 Feature Selection

The first step in text classification is to convert documents, represented by strings of characters, into a suitable format for the classification task. During the conversion, we determine the key phrases by viewing the word occupance in the feature space(vocabulary). We use the StringToWordVector filter in WEKA to convert the original article data into a new data set with word frequency for each word. This filtering idea works as follow. Each distinct word corresponds to a feature, with the number of appearance in the data set (all documents)in the data set (all documents). The more frequent a word appears in the data set, the more important the word represents to the document. However, words with high frequency are commonly the daily words we use everyday (like ’a’, ’the’, etc). Those stop words (also from WEKA) should be omitted, and our experiments shown stop words occupy one-fourth of the whole feature space. Last, we also filter numbers, punctuation, and etc.

3.3 Required Libraries

Here is the list of libraries that we have used in this project,

3.4 Experimental Results

We run our classification program into five categories (talk.politics.misc,soc.religion.christian,rec.sport.baseball, rec.autos,comp.sys.ibm.pc.hardware) with 1000 documents. Then, we performed our classification with 10 cross validation, and with different size of feature space. Furthermore, we compare the result with other SVM classification under Weka: WLSVM and SMO. The corresponding accuracy are as follow:

Classification Method# of document# of featuresAccuracy
SVM + Entropy 1000 8800 50.73 %
SVM + Entropy 1000 2000 60.26 %
Relaxed SVM 1000 8800 86.46 %
Relaxed SVM 1000 2000 86.56 %
WLSVM 1000 8800 22.47 %
WLSVM 1000 2000 37.11 %


Also, we tried to observe the effectiveness of the feature selection to this experiment. We found out that the accuracy is dramatically increase after we increase our features, but this improvement stops when we have sufficient large enough of features. However, we were not be able to specify the precise number of features in StringToWordVector in WEKA, so we only could present this approximated result. Therefore, a more precise experiment on feature selection should be consider in the future tasks.

From the results we could infer that the decision tree based SVM perform well when compared to all other ordinary algorithms. Though there are some algorithms which achieve higher accuracy than this decision tree based SVM, our implementation is in its over simplified form. Improving the algorithm in different ways could improve the results further.