Overview

We present BudgetedSVM, a C++ toolbox containing highly optimized implementations of three recently proposed algorithms for scalable training of Support Vector Machine (SVM) approximators: Adaptive Multi-hyperplane Machines (AMM), Budgeted Stochastic Gradient Descent (BSGD), and Low-rank Linearization SVM (LLSVM). The toolbox also includes Pegasos, a state-of-the-art linear SVM solver, as it is a special case of AMM. To avoid confusion, we note that our toolbox is not related to Budget-SVM method described in Dekel, O., Singer, Y., Support Vector Machines on a Budget, Advances in Neural Information Processing Systems (NIPS), 2007.
Some of the main features of BudgetedSVM package are listed below:

  • We provide efficient implementations of algorithms for highly-scalable non-linear SVM training.
  • The toolbox can handle large-scale, high-dimensional data sets that cannot be loaded into memory.
  • The toolbox requires constant memory to train accurate models that solve highly non-linear problems.
  • We provide command-line and Matlab interfaces to BudgetedSVM.
  • We provide an efficient API that provides functionalities for handling large-scale, high-dimensional data sets. Using BudgetedSVM API, data sets with millions data points and/or features are easily handled. For more details, please see the documentation included in the download package.
BudgetedSVM is capable of training non-linear models with accuracy comparable to LibSVM and time comparable to LibLinear. Unlike other popular, state-of-the-art machine learning libraries which provide very comprehensive set of various machine learning tools, such as Weka, Shogun, Shark, or scikit-learn, BudgetedSVM toolbox is focused on solving large-scale, non-linear classification problems when only limited computational resources are available. In particular, through highly-optimized C++ implementation of recently proposed large-scale SVM approximators, BudgetedSVM can be used to solve non-linear classification problems with millions of examples and/or features within minutes, even on a regular personal computer.

BudgetedSVM is a free, open-source software, published under the LGPL license. If you find the toolbox useful in your own work, please cite our JMLR paper.

Step-by-step command-prompt example

The interface design follows format of LibSVM and LibLinear, enabling users of these two popular packages to easily adopt BudgetedSVM for use in their projects. Before moving on, you will need to dowload BudgetedSVM toolbox, which can be found at the Download page. After downloading and unzipping the software package, cd to the directory where you unzipped the toolbox and type make to compile the toolbox. Then, in order to train and evaluate a classifier on adult9a data set (do not worry about finding the data set online, we included training and testing .txt files in the .zip file), we can run the following commands to train non-linear classifiers implemented in BudgetedSVM:

>> # train and test AMM algorithm
>> bin/budgetedsvm-train -A 1 -e 5 -L 0.001 -B 20 -D 123 -v 1 -k 10000 -c 10 a9a_train.txt a9a_model.txt
>> bin/budgetedsvm-predict -v 1 a9a_test.txt a9a_model.txt a9a_predictions.txt

>> # train and test LLSVM algorithm
>> bin/budgetedsvm-train -A 3 -L 0.1 -K 0 -g 0.01 -B 100 -m 1 -D 123 -v 1 a9a_train.txt a9a_model.txt
>> bin/budgetedsvm-predict -v 1 a9a_test.txt a9a_model.txt a9a_predictions.txt

>> # train and test BSGD algorithm
>> bin/budgetedsvm-train -A 4 -g 0.01 -e 5 -L 0.0001 -B 200 -m 1 -D 123 -v 1 a9a_train.txt a9a_model.txt
>> bin/budgetedsvm-predict -v 1 a9a_test.txt a9a_model.txt a9a_predictions.txt

Note that the train and predict programs were created in the "./bin" folder, which is why we appended "bin/" to the calls to the functions. If you have Windows operating system, be careful to use "\" (back-slash) instead of "/" (forward-slash) when specifying the path to the programs in the command prompt. After running the examples, algorithms should return accuracy of around 15% (the data set gets shuffled during training so the accuracy could fluctuate slightly between runs). Detailed explanation of the above code examples follows:

  • AMM example: The first command uses AMM batch (-A 1) algorithm to train multi-hyperplane machine for 5 epochs (-e 5), using regularization parameter lambda of 0.001 (-L 0.001, larger values result in less complex model, or, in other words, more regularization) and setting the maximum number of weights per class to 20 (-B 20). As adult9a data set is of dimensionality 123, we also write -D 123, and choose verbose output (-v 1) which prints detailed steps of the algorithm. Lastly, we specify that pruning of weigths will be performed every 10,000 iterations (-k 10000, smaller values results in more aggressive pruning), and the pruning parameter is set to 10 (-c 10, larger values result in more aggressive pruning). If you did not specify a name for the model file, it will be created such that suffix .model is appended to the filename of the training file (note that we did include the model filename in the above example). The second command evaluates the model on testing data set, and prints the accuracy on the testing set while saving the predictions to a9a_predictions.txt.
  • LLSVM example: The first command uses LLSVM (-A 3) algorithm to train a classification model, setting regularization parameter to 0.1 (-L 0.1, larger values result in less complex model, or, in other words, more regularization), which result in higher regularization than in the AMM case described above. We use Gaussian kernel (-K 0) with kernel width 0.01 (-g 0.01). With -B 100 option we set the budget, specifying that the model will comprise 100 landmark points which will be chosen by running k-means on the loaded training data (-m 1). As adult9a data set is of dimensionality 123, we also write -D 123, and choose verbose output (-v 1) which prints detailed steps of the algorithm. If you did not specify a name for the model file, it will be created such that suffix .model is appended to the filename of the training file (note that we did include the model filename in the above example). The second command evaluates the model on testing data set, and prints the accuracy on the testing set while saving the predictions to a9a_predictions.txt.
  • BSGD example: The first command uses BSGD (-A 4) algorithm to train a classification model for 5 epochs (-e 5), using learning rate lambda of 0.0001 (-L 0.0001, larger values result in less complex model, or, in other words, more regularization) and kernel width of Gaussian kernel of 0.01 (-g 0.01). With -B 200 option we specifying that the model will comprise 200 support vectors, and in the case of a budget overflow two support vectors will be merged (-m 1) to maintain the budget. As adult9a data set is of dimensionality 123, we also write -D 123, and choose verbose output (-v 1) which prints detailed steps of the algorithm. If you did not specify a name for the model file, it will be created such that suffix .model is appended to the filename of the training file (note that we did include the model filename in the above example). The second command evaluates the model on testing data set, and prints the accuracy on the testing set while saving the predictions to a9a_predictions.txt.
For a list of all available options simply run budgetedsvm-train and budgetedsvm-predict without any arguments, or see the readme.txt file included in the package.

Step-by-step Matlab example

We provide similar interface for Matlab/Octave as well. We note that we included functions libsvmread and libsvmwrite in the toolbox, used for loading the data sets into Matlab and for saving the Matlab matrices to .txt files, respectively, written by Rong-En Fan and Kai-Wei Chang from National Taiwan University as a part of LibSVM package. After downloading and unzipping the software package, cd to the matlab folder located in the directory where you unzipped the toolbox, and type make to compile the toolbox. Then, in order to train and evaluate AMM classifier on adult9a data set, we can run the following commands:

>> % load the data into Matlab
>> [a9a_label, a9a_inst] = libsvmread('../a9a_train.txt');
>> % train a non-linear model on the training set
>> model = budgetedsvm_train(a9a_label, a9a_inst, '-A 2 -L 0.001 -v 1 -e 5');
>> % evaluate the trained model on the training data
>> [accuracy, predict_label] = budgetedsvm_predict(a9a_label, a9a_inst, model, '-v 1');

Unlike in the command-prompt example given above, in this example we use AMM online (-A 2) algorithm to train multi-hyperplane machine. However, the above example assumes that the data can be loaded to Matlab workspace. That might not be possible for large-scale data sets, and for that reason we also included a version where the data is loaded into BudgetedSVM directly from the .txt files, and the learned model is stored in the .txt file instead in the Matlab workspace:

>> % train a non-linear model on the training set
>> budgetedsvm_train('../a9a_train.txt', '../a9a_model.txt', '-A 2 -L 0.001 -v 1 -e 5 -D 123');
>> % evaluate the trained model on the testing data
>> [accuracy, predict_label] = budgetedsvm_predict('../a9a_test.txt', '../a9a_model.txt', '-v 1');

After running the examples in Matlab prompt, algorithm should return accuracy of around 15%.
Here we do not provide examples for other implemented algorithms as they can be easily trained by using option strings shown in the above command-prompt examples. For a list of all available options simply run budgetedsvm_train and budgetedsvm_predict without any arguments, or see the matlab/readme.txt file included in the package.

Empirical evaluation

BudgetedSVM algorithms can learn a model even for data sets with millions of data points and features, with training times orders of magnitude faster than SVM with RBF kernel (RBF-SVM) trained using LibSVM. For illustration, in Table 1 we give comparison of error rates and training times on several large-scale, high-dimensional binary classification tasks which can be found here (N is the number of data points, D is the data dimensionality, S is the fraction of non-zero features, B is the size of the trained model). Note that Pegasos is a state-of-the-art linear classifier, and its performance is expected to be the same as that of other linear classifiers. We can see that on webspam and rcv1-bin data sets it took LibSVM hours to train RBF-SVM, while BudgetedSVM algorithms with much smaller budgets achieved high accuracies within minutes, and even seconds in the case of AMM. Note that the budget B of algorithms implemented in BudgetedSVM is several orders of magnitude smaller than number of support vectors found by RBF-SVM.

Table 1. Error rates (in %) and training times1 on benchmark data sets
PegasosAMMLLSVMBSGDLibLinear2RBF-SVM
Data set
err.time
err.Btime
err.Btime
err.Btime
err.time
err.time
N = 280,000
webspamD = 254
S = 41.9%
7.940.5s
4.7493s
3.465002.5m
2.601,0006.1m
1.993,0000.5h
2.045002.0m
1.721,0003.9m
1.493,0000.2h
7.361s
0.774.0h
(#SV:26,447)
N = 677,399
rcv1-binD = 47,236
S = 0.2%
2.731.5s
2.39199s
4.975000.2h
4.231,0000.5h
3.053,0002.2h
3.335000.8h
2.921,0001.5h
2.533,0004.4h
2.434s
2.1720.2h
(#SV:50,641)
N=8,000,000
mnist8m-binD = 784
S = 19.3%
22.711.1m
3.16184m
6.845001.6h
4.591,0003.8h
2.593,00015h
2.235002.3h
1.921,0004.9h
1.623,00016.1h
22.654.9m
0.43N/A3

1. We excluded data loading time, and evaluated performance on Intel E7400 with 2.80GHz processor and 4GB RAM. Scripts used to generate the results in the table can be found here: BudgetedSVM, LibLinear, LibSVM.
2. We used LibLinear extension used when data set can not be loaded into the memory, found here.
3. Listed accuracy obtained after 2 days of P-packSVM training on 512 processors; from Zhu, Z. A., Chen, W., Wang, G., Zhu, C., Chen, Z. P-packSVM: Parallel primal gradient descent kernel SVM, International Conference on Data Mining (ICDM), 2009.