by Anil Ananthaswamy (Simons Institute Science Communicator in Residence)
In August 2014, a significant advance in computing made the cover of the journal Science. It was IBM’s 5.4 billion-transistor chip that had a million hardware neurons and 256 million synapses. Algorithms running on this “neuromorphic” chip, when fed a video stream, could identify multiple objects, such as people, bicycles, trucks, and buses. Crucially, the hardware neural network consumed a mere 63 milliwatts, about 176,000 times less energy per synaptic event than the same network simulated on a general-purpose microprocessor.
The secret behind the low energy consumption was the type of hardware neurons on the chip. Unlike artificial neurons in modern deep neural networks, these were “spiking” neurons. Much like their biological counterparts, these neurons communicated via electrical spikes. Researchers have been studying spiking neural networks (SNNs) for decades in hopes of emulating the brain and, more recently, to build better, energy-efficient neural networks.
Computational neuroscientists think that spiking neural networks will bring them closer to understanding the brain than is possible with current deep neural networks. “My main motivation to think about spiking neural networks is because it's the language the brain speaks,” said computational neuroscientist Friedemann Zenke of the Friedrich Miescher Institute for Biomedical Research (FMI) in Basel, Switzerland. “So if you want to understand the signals that we can measure from the brain, we need to learn the language.”
Artificial intelligence researchers, on the other hand, would like to build deep neural networks that have both the brain’s remarkable abilities and its extraordinary energy efficiency. The brain consumes only about 20 watts of power. If the brain achieves its ends partly because of spiking neurons, some think that energy-efficient deep artificial neural networks (ANNs) would also need to follow suit.
But spiking neural networks have been hamstrung. The very thing that made them attractive — communicating via spikes — also made them extremely difficult to train. The algorithms that ran on IBM’s chip, for instance, had to be trained offline at considerable computational cost.
That’s set to change. Researchers have developed new algorithms to train spiking neural networks that overcome some of the earlier limitations. And at least for networks of tens of thousands of neurons, these SNNs perform as well as regular ANNs. Such networks would likely be better at processing data that has a temporal dimension (such as speech and videos) compared with standard ANNs. Also, when implemented on neuromorphic chips, SNNs promise to open up a new era of low-energy, always-on devices that can operate without access to services in the cloud.
One of the key differences between a standard ANN and a spiking neural network is the model of the neuron itself. Any artificial neuron is a simplified computational model of biological neurons. A biological neuron receives inputs into the cell body via its dendrites; and based on some internal computation, the neuron may generate an output in the form of a spike on its axon, which then serves as an input to other neurons. Standard ANNs use a model of the neuron in which the information is encoded in the firing rate of the neuron. So the function that transforms the inputs into an output is often a continuous valued function that represents the spiking rate. This is achieved by first taking a weighted sum of all the inputs and then passing the sum through an activation function. For example, a sigmoid turns the weighted sum into a real value between 0 and 1.
In a spiking artificial neuron, on the other hand, the information is encoded in both the timing of the output spike and the spiking rate. The most commonly used model of such a spiking neuron in artificial neural networks is called the leaky integrate-and-fire (LIF) neuron. Input spikes cause the neuron’s membrane potential — the electrical charge across the neuron’s cell wall — to build up. There are also processes that cause this charge to leak; in the absence of input spikes, any built-up membrane potential starts to decay. But if enough input spikes come within a certain time window, then the membrane potential crosses a threshold, at which point the neuron fires an output spike. The membrane potential resets to its base value. Variations on this theme of an LIF neuron form the basic computational units of spiking neural networks.
In 1997, Wolfgang Maass of the Institute of Theoretical Computer Science, Technische Universität, Graz, Austria, showed that such SNNs are computationally more powerful, in terms of the number of neurons needed for some task, than ANNs with rate-coding neurons that use a sigmoid activation function. He also showed that SNNs and ANNs are equivalent in their ability to compute some function (an important equivalence, since an ANN’s claim to fame is that it is a universal function approximator: given some input, an ANN can be trained to approximate any function to transform the input into a desired output).
Maass’s work inspired some to look anew at SNNs. This was still the late 1990s, and ANNs hadn’t become as ubiquitous and powerful as they are now. “People were looking for new paradigms and new ways of computing with neurons that could be more promising than what at that point was being done with artificial neural networks,” said Sander Bohte, a computational neuroscientist at Centrum Wiskunde & Informatica (CWI), the Dutch national research institute for mathematics and computer science in Amsterdam.
But despite their computational power, SNNs posed a major challenge. It was extremely hard to train them, as compared with standard ANNs, which are trained using the backpropagation algorithm. The algorithm updates the strength of the connections, or weights, between neurons during training, proceeding backward from the network’s output to the input layer. Doing this iteratively over large samples of training data allows the ANN to arrive at some optimal value for its weights that can then transform a given input into the desired output. The algorithm, however, requires that the activation function used for the ANN’s neurons to be differentiable. That’s because the algorithm starts with a loss function, which quantifies the error the ANN makes during training. It then uses differentiation to calculate the gradient of the loss function with respect to the weights, and minimizes the loss via a method called gradient descent. But the algorithm works only for activation functions that are differentiable and hence cannot be used for spiking neural networks: the output of each spiking neuron is not continuous and differentiable everywhere (when the neuron is not spiking, its gradient is zero, and it’s infinite at the moment the spike occurs).
So, in the early 2000s, inspired by Maass’s paper, Bohte began working on making backprop work for spiking neural networks. Training a network involves two parts: the forward pass, when the network is transforming an input into an output, and the backward pass, in which the blame for the overall loss is apportioned to each individual weight in the network (via backprop), and the weights are appropriately modified. Bohte’s insight was to let the neurons spike during the forward pass, but use an approximation for the neuron’s output during the backward pass. “You have to find a solution for approximating the gradient when the neuron spikes,” said Bohte. “How do the weights and the input influence the spike time of your output? That's the relation that you're trying to get.” He made the assumption that the relation is roughly linear around the time the neuron’s membrane potential crosses the threshold, causing the neuron to spike. So, if x is the output of a neuron and t is time, then δX/δt varies linearly, instead of being zero or infinity, for some suitably small value of δt straddling the instant when the neuron spikes. With some other strategies (such as the choice of initialization of weights before training begins), Bohte was able to avoid the discontinuity at time of spiking, and thus was able to use his version of the backpropagation to train the network. It was the first demonstration of such an algorithm for SNNs.
But his hand-crafted backprop for SNNs wasn’t scalable. ANNs, on the other hand, turned a corner in 2012, when a deep neural network named AlexNet won an annual image recognition contest. It had 60 million parameters (or weights), and researchers used backpropagation and GPUs to train the network. It was entirely unclear how spiking neural networks could ever scale such heights.
That began changing around 2015. Tim Lillicrap, who was then a postdoctoral researcher at the University of Oxford and is now at Google DeepMind in London, published with colleagues a paper showing that the matrix of weights used by a neural network can be different for the forward pass and the backward pass, and it could still learn. First, you do a forward pass and calculate the loss. The loss is then used to update the weights. However, during the backward pass, the backprop algorithm always uses a randomly initialized weight matrix that never changes during training. It’s akin to having a random gradient for the loss function, which results in some update to the weight matrix used for the forward pass in each iteration of the training. Despite this seemingly untenable setup, backpropagation manages to minimize the loss function using gradient descent. Lillicrap and colleagues called their method “feedback alignment.”
This was a huge moment for computational neuroscientists, who had mostly been skeptical of backpropagation as a biologically plausible learning mechanism for the brain. Feedback alignment showed that the weights for the forward and backward passes need not be symmetric, an important advance, since symmetry implies propagating a signal through the same axon in the backward direction, a physical impossibility in a biological neuron. “You can mangle the gradient signals massively and still get substantial learning in complicated, hierarchical networks,” said Zenke. That suggested to him that they could “come up with approximate gradients for spiking neural networks,” and still be able to use industry-grade backpropagation to train SNNs.
This realization led to a surge in attempts to train spiking neural networks by replacing the unusable gradients of spiking neurons at the instant of spiking with some suitable approximation. Multiple teams came to this realization, including Steven Esser and colleagues at IBM; Maass and his doctoral student Guillaume Bellec and colleagues; and Emre Neftci of the University of California, Irvine, together with Hesham Mostafa of Intel and Zenke.
Neftci, Mostafa, and Zenke called it the method of “surrogate gradients” — and the name stuck. The basic idea is this: replace the Heaviside function used for thresholding the output of the spiking neuron with a continuous, differentiable function (such as a sigmoid), but only for the backward pass. Their innovation was in developing a method that could leverage the power of modern deep learning frameworks to do backpropagation. First, they showed how an SNN can be treated as a recurrent neural network (a network in which information doesn’t just flow in one direction, say, from neuron A to B, as it does in standard feedforward ANNs, but can flow backward too, from B to A). The recurrence in SNNs can be explicit, with actual recurrent connections, or implicit, in that the recurrence is an outcome of the spiking behavior of the neurons. A recurrent neural network (RNN) is architecturally equivalent to a sequence of feedforward networks, with one such network per time-step. Once an RNN is flattened out in this manner, one can use the backpropagation algorithm to train the RNN, with the errors propagating backward through the entire sequence of feedforward networks. The method is called backpropagation through time (BPTT).
In 2019, Zenke and colleagues showed how surrogate gradients and BPTT could be used to train much larger spiking neural networks than had been possible before. This year, Zenke and Tim Vogels of the University of Oxford showed that the choice of the type of surrogate gradient — which becomes a hyperparameter for the network — can influence whether an SNN can be trained effectively. While changing the shape of the surrogate gradient doesn’t impact learning, its scale does. Gradients with values normalized to 1 perform well, but surrogate gradients that have large values can negatively impact learning.
The method has caught on. “It's now used quite widely for optimizing spiking neural networks,” said Zenke. “And this has now opened the door to the same breakthroughs that we had in deep learning.” Spiking neural networks trained using such surrogate gradients and BPTT are matching the performance of standard ANNs for some of the smaller tasks, such as recognizing digits in the MNIST data set.
Dan Goodman, of Imperial College London, thinks that this technique for training SNNs is “the most promising direction at the moment.” But he acknowledges that it still has technical problems that need solving. For example, the method is a memory hog. Because of the use of BPTT, if your spiking neural network has to be sensitive to spikes at resolutions of a millisecond, then essentially you have to store the copy of the entire state of the network at each time-step of one millisecond. If your network has a thousand neurons, you have to store a thousand copies of it. That’s essentially a million neurons. “It all has to be in memory when you compute the gradient backwards,” said Goodman. “You can see that very quickly, you run out of [memory].”
A new technique for training spiking neural networks may overcome this limitation. In June of this year, Timo Wunderlich and Christian Pehle, who were then at the Kirchhoff-Institute for Physics, Heidelberg University, Germany, published an algorithm that can compute exact gradients for spiking neural networks. Their algorithm resorts to insights from physics. In physical systems that need to compute the gradient of some cost function to identify optimal values for the system’s parameters, in addition to the equations that govern the dynamics of the physical system, physicists often define additional so-called adjoint differential equations. These equations can be solved to arrive at values for adjoint variables, which can then be used to compute exact gradients of the original cost function.
In their work, Wunderlich and Pehle started with dynamical equations that describe a network of spiking LIF neurons. Their goal was to find parameters or synaptic weights that minimized a loss function that depended on state variables (such as the membrane potential of the neurons). Then, they defined the adjoint equations, which can be solved and evaluated backward in time, to arrive at the relevant gradients of the loss function. The gradients are well-defined, except at critical points in the parameter space, where spikes are added or removed. Again, insights from studies of hybrid systems (which are continuous dynamical systems with occasional jumps in their evolution) show that the partial derivatives of state variables with respect to synaptic weights or parameters jump at this discontinuity and this jump can be calculated. This is possible because of a theorem called the implicit function theorem, which also allows one to compute the total derivative of the spike time with respect to the synaptic weights in terms of partial derivatives of the state variables, without analytical knowledge of the spike times. The adjoint variables also jump at this precise moment in time. So combining the backward evolution of the adjoint variables with the jumps when the spikes occur allows their algorithm to evaluate the exact gradients of the original cost function.
The crucial property of the algorithm, which they call EventProp, is its potentially reduced memory requirement, compared with backpropagation through time, which requires the entire state of the network stored for each time-step. “At least for LIF neurons, you only need to maintain the state at spike times,” said Pehle. “So the rest of it, you can just forget. And, in practice, that has the potential to make a huge difference.”
The team has used EventProp to train SNNs on small data sets, such as MNIST. “So far, we perform as well or slightly better than [the surrogate gradient] method,” said Pehle. But, “the surrogate gradient methods have been scaled to larger problems. I'm hopeful that [EventProp] can scale to larger networks.”
Meanwhile, Goodman and colleagues have been training SNNs with different techniques and have settled on using the surrogate gradient method for now. “[It] was the only one that we could robustly get to work,” said Goodman.
While examining the various training methods, the researchers realized something: for any given SNN, researchers always used neurons of exactly the same type. In other words, the parameters that defined a neuron did not change between populations of neurons in a network. This is unlike the brain. “If you look at the brain, those neuron parameters vary a lot from neuron to neuron,” said Goodman. What if you allowed these parameters to change during training?
Goodman’s team decided to let the training algorithms tune not just the strengths of the connections between neurons, but also the neuron’s time constant, which dictates the rate at which the membrane potential decays. They found that such a "heterogeneous" SNN was no better than a "homogeneous" SNN trained to only update weights, if the task had no time-varying, or temporal, component, such as classifying digits in the MNIST data set.
But when the task had a strong temporal structure, the improvement was substantial. Last year, Zenke and colleagues published such a data set, the Spiking Heidelberg Digits (SHD) data set, which is a spiking version of the MNIST data set. They used a model of an artificial cochlea to turn 10,000 audio recordings of spoken digits (from 12 speakers) into spikes in 700 channels, imitating the kinds of inputs received by biological neurons. These inputs were then fed into SNNs, whose task was to classify the digits encoded by these spikes. Goodman’s team found that a heterogeneous SNN did much better than a homogeneous one. “We found that heterogeneity in time constants had a profound impact on performance on those training datasets where information was encoded in the precise timing of input spikes,” the authors wrote in a Nature Communications paper. “On the most temporally complex auditory tasks, accuracy improved by a factor of around 15–20%.”
Heterogeneity may also be a metabolically efficient solution for the brain, in that it requires far fewer heterogeneous neurons to accomplish the same task than homogeneous ones. In one comparison performed by Goodman and colleagues, an SNN without heterogeneity required 1,024 neurons to achieve an accuracy of 83.2% on the SHD data set, whereas a heterogeneous SNN got to within 82.7% accuracy using a mere 128 neurons. “The improvement was really substantial,” said Goodman.
All this bodes well for the day when spiking neural networks can be implemented on the numerous neuromorphic chips that are in development. The hope is that such networks can be both trained and deployed using dedicated hardware that sips rather than sucks energy.