Abstract

The Occam's Razor principle is easily expressed in pac learning bounds via noting that hypothesis description length complexity easily appears in these bounds. With the transformer models underlying modern ai agents it appears that there is a different form of occam's razor principle: transformers that learn a simple state generalize and even extrapolate better. The empirical evidence here is stark: a factor of 3(!) reduction in extrapolation error. But what are the principles behind this? Standard PAC theory appears inapplicable.

Video Recording