Humans and (almost certainly) many animals possess extremely rich low-to-mid-level visual scene understanding concepts, including (inter alia) the ability to estimate contours and border ownership, optical flow and self-induced motion, a "2.5D sketch" of monocular depth and surface normals, various kinds of segmentation, 3D shape, and materials. But where do these concepts come from, especially given that real biological organisms are definitely not getting detailed supervision for all these rich visual properties? I will present a working theory of how they arise, based on recent work from my lab on Counterfactual World Models (CWMs). Specifically, I will describe a specific form of masked prediction that enables the training of large-scale predictive models that organically possess high-quality causally-informative tokens. I will then show how a wide variety of mid-level visual concepts arise via performing simple generic interventions on these tokens, and computing counterfactual effects therefrom. The resulting model takes a step towards a generic unsupervised algorithm for visual scene understanding in machines, and helps us formulate novel and intriguing hypotheses for the origins of biological vision.