Abstract
The human visual system does not passively view the world, but actively moves its sensor array through eye, head and body movements. Here we explore the consequences of the active perception setting for learning efficient visual representations. This work focuses on two specific questions: 1) what is the optimal spatial layout of the image sampling lattice for visual search via eye movements? and 2) how should information be assimilated from multiple fixations in order to form a holistic scene representation that allows for visual reasoning about compositional structure of the scene? We answer these questions through the framework of end-to-end learning in a structured neural network trained to perform search and reasoning tasks. The derived models provide new insight into the neural representations necessary for an efficient, functional active perception system.