Boosting Simple Learners Smoothly

Abstract

Boosting is a powerful method in machine learning where a weak learner that produces hypotheses with a very small correlation to a function can be amplified into a strong learner that predicts the function almost everywhere. Boosting is fundamental to both the theory and application of machine learning (see Shapire, Freund and Shapire). Smoothed boosting introduced by Servedio is a restricted form of boosting where the reweightings are bounded in order not to concentrate the weak learner on a very narrow subset of the inputs.

In addition to being more robust (e.g., to noise in the data), smoothed boosting also is used in the proofs of hardcore distribution lemmas in complexity theory. Hardcore distribution lemmas (Impagliazzo, Holenstein) are used in average-case complexity, cryptography, and pseudo-randomness to decompose a somewhat hard on average Boolean function into ''easy'' and ''pseudo-random'' regions, where the pseudo-random region cannot be predicted efficiently much more than a fair coin.

It is well-known that the optimal number of iterations for boosting is $$O(log 1/\epsilon/ gamma^2)$$, where $$\epsilon$$ is the final error and $$\gamma$$ is the correlation guarantee for the weak learner. A smoothed boosting algorithm with this optimal number of iterations was found by Barak, Hardt and Kale. However, recently, Alon, Gonen, Hazan and Moran beat the optimal for the important special case when the weak learner produces concepts from a class of small VC dimension. Since small VC dimension is also important for generalization, this is a very important special case, and they show that the number of iterations for this special case can be inverse linear in $$\gamma$$. Their boosting algorithm is not smoothed.

We give a simplified presentation of their result, and then a variety of related smoothed boosting algorithms, with different trade-offs in iterations vs. smoothness. We give a perfectly smoothed algorithm that has iteration complexity $$O(1/\epsilon \gamma)$$; while the dependence on $$\gamma$$ matches AGHM, the dependence on $$\epsilon$$ is exponentially worse. We give another that is only $$\epsilon /\log \gamma$$ smoothed, losing a log factor, but has the same number of iterations as AGHM up to logarithmic factors. We show some consequences for hardcore distributions, multicalibration and regularity for small dimensional classes.