For this experiment, we adopted 20-Newsgroup dataset 1 selected 5 arbitrary classes of news groups from the collection for our experiments.
We have implemented several program in Java for this project. In the front end, we have a newsgroup parser to extract plain text, filter all relatively non-important keywords, and convert the data set into designated SVMLight input. Then, we use a decision-tree-like structure, implemented with SVMLight, to perform the classification, evaluated with information gain at each level. Finally, we perform K cross validation for testing purposes.
The first step in text classification is to convert documents, represented by strings of characters, into a suitable format for the classification task. During the conversion, we determine the key phrases by viewing the word occupance in the feature space(vocabulary). We use the StringToWordVector filter in WEKA to convert the original article data into a new data set with word frequency for each word. This filtering idea works as follow. Each distinct word corresponds to a feature, with the number of appearance in the data set (all documents)in the data set (all documents). The more frequent a word appears in the data set, the more important the word represents to the document. However, words with high frequency are commonly the daily words we use everyday (like ’a’, ’the’, etc). Those stop words (also from WEKA) should be omitted, and our experiments shown stop words occupy one-fourth of the whole feature space. Last, we also filter numbers, punctuation, and etc.
Here is the list of libraries that we have used in this project,
We run our classification program into five categories
(talk.politics.misc,soc.religion.christian,rec.sport.baseball,
rec.autos,comp.sys.ibm.pc.hardware) with 1000 documents. Then, we
performed our classification with 10 cross validation, and with different size
of feature space. Furthermore, we compare the result with other SVM
classification under Weka: WLSVM and SMO. The corresponding accuracy are as
follow:
| Classification Method | # of document | # of features | Accuracy |
| SVM + Entropy | 1000 | 8800 | 50.73 % |
| SVM + Entropy | 1000 | 2000 | 60.26 % |
| Relaxed SVM | 1000 | 8800 | 86.46 % |
| Relaxed SVM | 1000 | 2000 | 86.56 % |
| WLSVM | 1000 | 8800 | 22.47 % |
| WLSVM | 1000 | 2000 | 37.11 % |
From the results we could infer that the decision tree based SVM perform well when compared to all other ordinary algorithms. Though there are some algorithms which achieve higher accuracy than this decision tree based SVM, our implementation is in its over simplified form. Improving the algorithm in different ways could improve the results further.