![AI Psychology and Neuroscience](/sites/default/files/styles/workshop_banner_sm_1x/public/2023-10/AI%20Psychology%20and%20Neuroscience_wide.jpg?h=7f4b8f06&itok=RVwIG8gB)
Abstract
Lesson 1 from the classic paper "The Development of Embodied Cognition: Six Lessons from Babies" is `Be Multimodal'. This talks explores how recent work in the computer vision literature on audio-visual self-supervised learning addresses this challenge. The aim is to learn audio and visual representations and capabilities directly from the audio-visual data stream of a video (without providing any manual supervision of the data) - much as an infant could learn from the correspondence and synchronization between what they see and hear. It is shown that a neural network that simply learns to synchronize audio and visual streams is able to localize the faces that are speaking (active speaker detection) and objects that sound.