Abstract

Variable binding—the ability to temporarily associate variables with values while maintaining their independent identity—is considered a fundamental requirement for symbolic computation and higher cognitive functions. Classical theories argue that this capability requires specific architectural features, particularly an addressable read/write memory system supporting indirect addressing. This has led to skepticism about whether neural networks, which lack such explicit architectural support, can implement genuine variable binding mechanisms.

In this study, we investigate whether and how a Transformer-based neural network can learn to solve a variable binding and dereferencing task. We train the model on synthetic programs containing chains of variable assignments, where success requires tracking multiple variable bindings and resolving reference chains of varying depths. The task is specifically designed to require systematic variable binding rather than simple pattern matching, as the network must maintain and traverse complex graphs of variable references while ignoring irrelevant distractor chains.

The Transformer architecture's residual stream provides a high-dimensional vector space that could theoretically support variable binding through learned partitioning into functional subspaces. Using mechanistic interpretability techniques, we investigate how the trained network implements this symbolic computation. Our analysis reveals how the network initially learns shallow heuristics, then undergoes a rapid transition towards implementing a general algorithm to solve the task by tracking variable assignments.

This work contributes to our understanding of how neural architectures might support symbolic computation, suggesting that capabilities traditionally thought to require specific architectural features can emerge through learning. I will conclude by reflecting on the implications of these findings for cognitive science and artificial intelligence.