Abstract

Large language models (LLMs) are increasingly deployed on complex tasks that require multi-step decision-making, making it crucial to understand their algorithmic reasoning abilities. However, existing benchmarks do not provide fine-grained diagnostics for evaluating these capabilities. We propose to use data structures as a principled lens: as fundamental building blocks of algorithms, they naturally probe structural reasoning—the ability to understand and manipulate relationships such as order, hierarchy, connectivity, and composition. We introduce the Data Structure Reasoning Benchmark (DSR-Bench), which spans 20 data structures, 35 operations, and 4,140 problem instances. DSR-Bench supports fully automated generation and evaluation, and enables fine-grained diagnosis of where structural reasoning breaks down. Evaluating 13 state-of-the-art LLMs reveals critical limitations: failures emerge under compositional and multi-hop reasoning, length scaling, user-specified constraints, spatial distribution shift, and natural-language framing.

This talk describes joint work with Yu He, Yingxi Li, and Colin White.

Video Recording