Abstract

We consider sequence-to-sequence models that iterate the same function, from some base function class, at every step to obtain the next token. E.g., transformers use the same weights and thus same mapping at each step to obtain the next token from the previous one. We discuss the computational/representational power of such models even with very simple base classes, and the sample and computational complexity of learning either end-of-end or with access to the entire "chain of thought".

Video Recording