## Implemented algorithms

BudgetedSVM provides highly optimized implementations of recently proposed algorithms for scalable training of Support Vector Machine
(SVM) classifiers. In particular, we provide highly-optimized implementions of Pegasos, Adaptive Multi-hyperplane Machines (AMM), Budgeted Stochastic Gradient Descent (BSGD), and Low-rank Linearization SVM (LLSVM), described in more details below. Combining the efficient implementation with high accuracy of these non-linear algorithms, BudgetedSVM learns models with accuracies comparable to LibSVM in time comparable to LibLinear, and allows solving highly non-linear problems with millions of high-dimensional examples within minutes on a regular personal computer. Overview of the algorithms implemented in the BudgetedSVM toolbox is given in Table 1.

Table 1. Overview of BudgetedSVM algorithms
Algorithm | Type of classifier | Multi-class? | Available kernels |

**Pegasos** | Linear | Multi-class | Linear |

**AMM** | Non-linear | Multi-class | Linear |

**LLSVM** | Non-linear | Binary | Any |

**BSGD** | Non-linear | Multi-class | Any for random removal; Gaussian when merging support vectors |

### Primal Estimated sub-GrAdient SOlver for SVM (Pegasos)

Pegasos is a state-of-the-art linear SVM solver, which uses Stochastic Gradient Descent to learn a large-scale, multi-class model. Each class is assigned a single hyperplane (weight), and Pegasos predicts based on the associated class of the weight that provides the largest prediction. For more details please see:

*Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A., Pegasos: Primal estimated sub-gradient solver for svm, Mathematical Programming, 127(1), 3-30., 2011 * [

pdf]

### Adaptive Multi-hyperplane Machines (AMM)

AMM model consists of a set of hyperplanes (weights), each assigned to one of the multiple classes, and predicts based on the associated class of the weight that provides the largest prediction. The number of weights is automatically determined through an iterative algorithm based on the stochastic gradient descent (SGD) algorithm which is guaranteed to converge to a local optimum. Since the generalization bound decreases with the number of weights, a weight pruning mechanism that periodically removes hyperplanes is proposed and analyzed. For more details please see:

*Wang, Z., Djuric, N., Crammer, K., Vucetic, S., Trading Representability for Scalability: Adaptive Multi-Hyperplane Machine for Nonlinear Classification, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2011 * [

pdf]

### Low-rank Linearization SVM (LLSVM)

To scale up kernel SVM on limited resources, a low-rank linearization approach is proposed that transforms a non-linear SVM to a linear one via an approximate empirical kernel map computed from efficient low-rank approximation of kernel matrices. The algorithm inherits high efficiency of linear SVMs and rich repesentability of kernel classifiers. For more details please see:

*Zhang, K., Lan, L., Wang, Z., and Moerchen, F., Scaling up Kernel SVM on Limited Resources: A Low-rank Linearization Approach, International Conference on Artificial Intelligence and Statistics (AISTATS), 2012 * [pdf]

### Budgeted Stochastic Gradient Descent (BSGD)

Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online SVM training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to "the curse of kernelization" that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. BSGD keeps the number of support vectors bounded during SGD training through several greedy budget maintenance strategies based on removal and merging of support vectors. Note that, if the budget is maintained using the merging strategy, only Gaussian RBF kernel can be used; on the other hand, for random removal strategy any kernel function can be used. For more details please see:

*Wang, Z., Crammer, K., Vucetic, S., Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training, Journal of Machine Learning Research (JMLR), 13:3103-3131, 2012 * [pdf]

## Complexity analysis of the algorithms

The time and memory complexity analysis is given below in Table 1, where *N* is a number of training examples, *C* is a number of classes, *D* is data dimensionality, sparsity *S* is an average number of non-zero features, and *B* is the model size for AMM, BSGD, and LLSVM. Parameter *I* for SVM with RBF kernel (RBF-SVM) denotes a number of training iterations, empirically shown to be super-linear in *N*.

It can be seen that the implemented algorithms have very favorable linear dependence on the data set size, and *constant* memory, thus making our software suitable for large-scale problems with millions of data points. We can also see that the time does not depend on the dimensionality, instead only on the average number of non-zero elements in the training and test data. For that reason, BudgetedSVM can easily handle sparse data sets even with millions of features.

Table 2. Time and space complexities of the algorithms
Algorithm | Training time | Prediction time | Model size |

**Pegasos** | *O(N·C·S)* | *O(C·S)* | *O(C·D)* |

**AMM** | *O(N·S·B)* | *O(S·B)* | *O(D·B)* |

**LLSVM** | *O(N·S·B*^{2} + N·S·B) | *O(S·B*^{2} + S·B) | *O(D·B + B*^{2}) |

**BSGD** | *O(N·(C + S)·B)* | *O((C + S)·B)* | *O((C + D)·B)* |

**RBF-SVM** | *O(I·N·C·S)* | *O(N·C·S)* | *O(N·C·S)* |