After pre-processing and transformations, a supervised machine learning algorithm is employed used for learning how to classify documents. The task of supervised machine learning is to learn and generalize an input-output mapping. In case of text categorization the input is the set of documents, the output is their respective categories.
Support Vector Machine (SVM) was proposed by Vapnik [2] in 90’s, and ever since many scientists have been working on related research about SVM on text classification.
Given the training samples, xi, and a predefined transformation ϕ : Rd → F, which maps the features to a transformed feature space, the classifier separates the samples of the two classes ci = {+1,-1} with a hyperplane (whose equation is wT ⋅ x + b in the transformed feature space, building a decision rule of the following form:

This optimal hyperplane can be determined by minimizing the error on an new instance and seen example. In other words, SVM aims to find a maximum margin separating hyper-plane between two classes of data set. That is we are given training data to maximize the nearest distance between a point in one separated hyperplane and a point in the other separated hyperplane.
Second, with Kernel function, we can tell the similarities of two given samples. A simple kernel function, such as K(xi,xj) = xiT ⋅ x j suits well in linearly separable problems. Thus, combined with kernel function, the decision rule can be generalized as:

where K(ϕ(u),ϕ(v)) = ϕ(u) ⋅ ϕ(v) is the kernel function and αi,i = 1,…,n and b maximize the margin of the separating hyperplane, where c = {+1,-1}.
Several remarkable properties of SVM listed below showed its substantial performance in text classification[3]:
Nevertheless, the original SVM is designed for binary classification and a direct solution to the multi-class problem using SVM is usually avoided, due to the various complexities. The typical approach is to use a combination different classifying methods (usually accompanied with SVM) to solve a multi-class classification problem.
Information gain is frequently used in machine learning field, as a criterion of measurement for a word. This function evaluate the information obtained for each category prediction in number of bits, by knowing the presence or absence (or other feature weighing measurement) of a word in a document [4]. Entropy can be viewed as a measurement of how much MORE information you need in order to identify the answers. The higher the entropy is, the less predictable the data is. Thus, we want a maximum information gain, providing the most predictable data. With this idea, we use information gain to be our splitting criteria when categorize the documents.
To achieve better visual demonstration for the viewer, we created a binary tree structure structure, and made the SVM as decisions at each node. At each level of the true, we train m independent SVMs to perform binary classification for each corresponding category, where m is the number of class unused categories at the level. The best SVM is chosen at each level from the m SVM’s trained, based on the information gain.
To determine the best SVM at a given node, the entropy is found for each SVM, based on the number of positive instances classified by each classification.

It is then used to calculated the information gain based on the entropy of the root node.

where A is an attribute within set 𝔸, and Xv is document that satisfies the value
given by the attribute.
In the figure 5 above, there are three categories: A, B, C. At the first level, we train three different SVM, using SVMLight, one for each classes. Assume category A achieves the best one among those three classification (by calculating the information gain and such), we separate the instances at that level by expanding the tree to next level. The left child node contains the instances, which have been classified as positive document by the best SVM (in this case, category A). All the other instances (documents) are passed down to the right node where another decision is performed. We repeat the same process till all the classes (categories) are classified.
To reduce the complexity of the problem, we set the left node to be leaf node, and stop investigating further on the node. Therefore, the left node may contain some false positives case, as SVM may wrongly classified for the best SVM.
Supervised machine learning methods prescribe the input and output format. Mostly the input space is a vector space, the output is a single real number, or boolean (positive/negative). Machine learning algorithms use so-called features (in this case, the word frequency), to correspond to one dimension of the vector space.
First of all, documents need to be pre-processed. This usually means stopword filtering for omitting meaningless words (e.g. a, an, this, that), word stemming for reducing the number of distinct words, lowercase conversion etc.
Then the transformation takes place. The typical way of representing the frequency of a word occurred in the document, might lead to a problem. Take spam filtering as an example, for frequent terms, such as ”the”, would have a lot higher occurrence than those for rare terms, like ”unsubscribe”. The common words appears everywhere in the document, nevertheless, are rather meaningless than the rare but highly suspicious terms. Therefore, We adopted another similar scoring function to weigh each word in the document:
The name Term Frequency-Inverse Document Frequency (TF-IDF) actually applies to a term weighting scheme, which is defined as follows:

where wt,d is the weight of term (token) t in the document (message) d; tft,d is the number of occurrences of the term t in the document d; whereas dft is the number of messages in which the term t occurs, and n, as above, is the total number of documents in the training set.
The transformed documents together form the term-document matrix. It is desirable that documents of different length have the same length in the vector space, which is achieved with the so-called document normalization. However, The dimensionality of the vector space may be very high, which is disadvantageous in machine learning (complexity problems, over-learning)[1].