Abstract

The compute, memory, and energy requirements for training as well as inference with large language model (LLMs) are reaching trillions of levels per input. Large Context Attention, KV Cache blowup are appearing to have reached a limit. To materialize the full promise of LLMs, it is imperative to break the linear resource barriers. Sub-linear algorithms are not a option but a necessity to take LLMs beyond where they are. We will revisit several emerging ideas and successful trends in the use of sub-linear algorithms for future LLMs