Section: New Results
Large Scale Classification with Support Vector Machine Algorithms
Since Support Vector Machine (SVM) learning algorithms were first proposed by Vapnik  , they have been shown to build accurate models with practical relevance for classification, regression and novelty detection. Successful applications of SVMs have been reported for such varied fields as facial recognition, text categorization and bioinformatics. In particular, SVMs using the idea of kernel substitution have been shown to build good models, and they have become increasingly popular classification tools.
However, in spite of their desirable properties, current SVMs cannot easily deal with very large datasets. A standard SVM algorithm requires solving a quadratic or linear program; so its computational cost is at least O( m2) , where m is the number of training datapoints. Also, the memory requirements of SVM frequently make it intractable. There is a need to scale up these learning algorithms to handle massive datasets.
Effective heuristic methods to improve SVM learning time divide the original quadratic program into series of small problems  ,  . Incremental learning methods  ,  improve memory performance for massive datasets by updating solutions in a growing training set without needing to load the entire dataset into memory at once. Parallel and distributed algorithms  improve learning performance for large datasets by dividing the problem into components that execute on large numbers of networked PCs. Active learning algorithms  choose interesting datapoint subsets (active sets) to construct models, instead of using the whole dataset.
We propose methods to build boosting of incremental LS-SVM algorithms for classifying very large datasets on standard personal computers. Most of our work is based on LS-SVM classifiers proposed by Suykens and Vandewalle  . They replace standard SVM optimization inequality constraints with equalities in least squares error; so the training task only requires solving a system of linear equations instead of a quadratic program. This makes training time very short. We have extended LS-SVM in three ways:
We developed a row-incremental algorithm for classifying massive datasets (billions of points) of dimensionality up to 10 4 .
Using a Tikhonov regularization term and the Sherman-Morrison-Woodbury formula  , we developed a column-incremental LS-SVM algorithm for very-high-dimensional datasets with small training datapoints, such as bioinformatics microarrays.
Applying boosting techniques like Adaboost  and arcx4 to these incremental LS-SVM algorithms, we developed efficient classifiers for massive, very high-dimensional datasets.
We also applied these ideas to build boosting of other efficient SVM algorithms proposed by Mangasarian and colleagues: Lagrangian SVM (LSVM), Proximal SVM (PSVM) and Newton SVM (NSVM) in the same way, because they have similar properties to LS-SVM. Boosting based on these algorithms is interesting and useful for classification on very large datasets. Some performances in terms of learning time and accuracy are evaluated on UCI, Forest cover type, KDD cup 1999, Reuters-21578 and RCV1-binary datasets. The results showed that our boosting of LS-SVM algorithms are usually much faster and/or more accurate for classification tasks compared with the highly efficient standard SVM algorithm LibSVM and with two recent algorithms, SVM-perf and CB-SVM. An example of the effectiveness of the new algorithms is their performance on the 1999 KDD cup dataset. They performed a binary classification of 5 million datapoints in a 41-dimensional input space within 3 minutes on a standard PC when the fastest method (CB-SVM) required 30 minutes and LibSVM ran out of memory.