Abstract

A trained Large Language Model (LLM) contains much of human knowledge. Remarkably, many concepts can be recovered from the internal activations of neural networks via linear "probes", which are, mathematically, single index models. I will discuss how such probes can be constructed and used based on Recursive Feature Machines — a feature-learning kernel method originally designed for extracting relevant features from tabular data.