Section: New Results
Keywords : Support Vector Machine, Massive Classification, Incremental Learning, Ensemble Methods.
Large Scale Classification with Support Vector Machine Algorithms
Participants : ThanhNghi Do [ correspondant ] , JeanDaniel Fekete.
Since Support Vector Machine (SVM) learning algorithms were first proposed by Vapnik [56] , they have been shown to build accurate models with practical relevance for classification, regression and novelty detection. Successful applications of SVMs have been reported for such varied fields as facial recognition, text categorization and bioinformatics. In particular, SVMs using the idea of kernel substitution have been shown to build good models, and they have become increasingly popular classification tools.
However, in spite of their desirable properties, current SVMs cannot easily deal with very large datasets. A standard SVM algorithm requires solving a quadratic or linear program; so its computational cost is at least O( m^{2}) , where m is the number of training datapoints. Also, the memory requirements of SVM frequently make it intractable. There is a need to scale up these learning algorithms to handle massive datasets.
Effective heuristic methods to improve SVM learning time divide the original quadratic program into series of small problems [38] , [49] . Incremental learning methods [39] , [40] improve memory performance for massive datasets by updating solutions in a growing training set without needing to load the entire dataset into memory at once. Parallel and distributed algorithms [40] improve learning performance for large datasets by dividing the problem into components that execute on large numbers of networked PCs. Active learning algorithms [52] choose interesting datapoint subsets (active sets) to construct models, instead of using the whole dataset.
We propose methods to build boosting of incremental LSSVM algorithms for classifying very large datasets on standard personal computers. Most of our work is based on LSSVM classifiers proposed by Suykens and Vandewalle [50] . They replace standard SVM optimization inequality constraints with equalities in least squares error; so the training task only requires solving a system of linear equations instead of a quadratic program. This makes training time very short. We have extended LSSVM in three ways:

We developed a rowincremental algorithm for classifying massive datasets (billions of points) of dimensionality up to 10 ^{4} .

Using a Tikhonov regularization term and the ShermanMorrisonWoodbury formula [44] , we developed a columnincremental LSSVM algorithm for veryhighdimensional datasets with small training datapoints, such as bioinformatics microarrays.

Applying boosting techniques like Adaboost [42] and arcx4 to these incremental LSSVM algorithms, we developed efficient classifiers for massive, very highdimensional datasets.
We also applied these ideas to build boosting of other efficient SVM algorithms proposed by Mangasarian and colleagues: Lagrangian SVM (LSVM), Proximal SVM (PSVM) and Newton SVM (NSVM) in the same way, because they have similar properties to LSSVM. Boosting based on these algorithms is interesting and useful for classification on very large datasets. Some performances in terms of learning time and accuracy are evaluated on UCI, Forest cover type, KDD cup 1999, Reuters21578 and RCV1binary datasets. The results showed that our boosting of LSSVM algorithms are usually much faster and/or more accurate for classification tasks compared with the highly efficient standard SVM algorithm LibSVM and with two recent algorithms, SVMperf and CBSVM. An example of the effectiveness of the new algorithms is their performance on the 1999 KDD cup dataset. They performed a binary classification of 5 million datapoints in a 41dimensional input space within 3 minutes on a standard PC when the fastest method (CBSVM) required 30 minutes and LibSVM ran out of memory.