Combinatorial Optimization and Sparse Computation for Large Scale Data Mining

Abstract

We present here a new classification model for data mining/ machine learning that is based on combinatorial optimization solving the optimization problem of "normalized cut prime" (NC'). NC' is closely related to the NP-hard problem of normalized cut, yet is polynomial time solvable. NC' is shown to be effective in image segmentation and in approximating the objective function of Normalized Cut as compared to the spectral technique. Its adaptation as a supervised classification data mining technique is called Supervised Normalized Cut (SNC). In a comparative study with the most commonly used data mining and machine learning methods, including Support Vector Machines (SVM), neural networks, PCA, logistic regression, SNC was shown to deliver highly accurate results within competitive run times.

In scaling SNC to large scale data sets, its use of pairwise similarities poses a challenge since the rate of growth of the matrix of pairwise comparisons is quadratic in the size of the dataset. We describe a new approach called sparse computation that generates only the significant weights without ever generating the entire matrix of pairwise comparisons. The sparse computation approach runs in linear time in the number of non-zeros in the output and in that it contrasts with known ``sparsification" approaches that require to generate, in advance, the full set of pairwise comparisons and thus take at least quadratic time. Sparse computation is applicable in any set-up where pairwise similarities are employed, and can be used to scale the spectral method and the k-nearest neighbors as well. The efficacy of sparse computation for SNC is manifested by its retention of accuracy, compared to the use of the fully dense matrix, while achieving a dramatic reduction in matrix density and thus run times.