
Abstract
Recent works (e.g., Deepseek R1) shows that long CoT (Chain of Thought) greatly improves reasoning capability of Large Language Models (LLMs). However, it also poses significant challenges for memory efficiency —and consequently, time efficiency-- during inference, even for problems solvable with small space. This limitation stems from the non-erasable nature of standard CoT, which equals the space complexity (context length) to the time complexity (CoT length).
In this talk, we will introduce a new method, PENCIL, to improve the efficiency of CoT by incorporating a reduction mechanism into the autoregressive generation process. PENCIL enables the model to actively discard obsolete tokens by outputting special tokens which triggers the reduction mechanism. We show PENCIL can perform universal space-efficient computation, that is, PENCIL can simulate Turing machines with maximal context length matching its space complexity and total number of generated tokens matching its time complexity. By effectively reducing the maximal context length, PENCIL also decreases per-token generation time, enabling improved scalability compared to standard CoT. This efficiency gain translates into enhanced performance on complex reasoning tasks, including 97% accuracy on the challenging 5×5 Einstein’s puzzle, using a 25M-parameter transformer with a 2048-token context length.