Metrics for Evaluating Network Performance

How do we evaluate whether a spiking neural network is performing 'well' or 'badly' on a task? Up until now, the methods used have been a mix of some formal definitions mixed with ad-hoc measures. The result is a lot of metrics that are difficult to compare, difficult to establish the validity of, and notoriously difficult to reproduce. Often it is not clear whether the measure is even an accurate representation of the statistics of the behaviour. In this discussion group we will explore ideas for adding some statistical robustness and formal rigour to measuring neural performance. Many of these ideas derive from Information Theory and we will introduce fundamental concepts in Information Theory necessary to approach the problem systematically. Discussion and opinions is welcome on all aspects of measuring performance and their application to different tasks, viz. a classification task involving indentifying different classes of road sign is manifestly different from a robotic task asking the robot to successfully navigate a series of roads (possibly with signs!) to reach their destination safely and efficiently.

Login to become a member send


Day Time Location
Thu, 05.05.2016 14:00 - 15:00 Sala Panorama

The group identified 3 distinct benchmarking classes.

1) Behavioural benchmarks for whole networks
2) Dynamical reproduction benchmarks for specific neuron and synapse models
3) Benchmarks for comparing against biological datasets.

With respect to the first it was quickly acknowledged that any metrics developed
will be highly task-specific. Most people expressed a general dissatisfaction
with present benchmarks which tend to use ad-hoc metrics open to mathematical
criticism. 2 examples were presented of more formally justifiable metrics. The
first considers a synfire chain. The dynamics of the synfire chain can be
expressed in terms of initial spike activity vs. synchrony, in which there is
a clear separatrix between networks whose long-term dynamics will dissipate
(i.e. the synfire waves will disappear) and networks whose long-term dynamics
remain stable and convergent (i.e. the synfire waves propagate indefinitely with
increasing synchronisation). Since the network can be characterised in this way
it was proposed that the degree to which the implemented network reproduces this
theoretical separatrix could be a benchmark. (The 'degree of match' remains
something that must be formally defined) The second example considers object
tracking. If one has a ball following a ballistic trajectory, and neglecting air
resistance (such as, e.g. can be achieved by simulating a ball on a computer
screen), the object tracking performance can be measured by measuring degree of
match in both time and position to the trajectory. Again the physics of the
problem allows the analytic solution to be described and thus the degree of

One member considered the idea that separate benchmark figures on a variety of
tasks could be combined into a Quality of Service figure, however, most of the
particpants noted that since there was no way to normalise individual metrics
given the radically different nature of the task and the measurement, such an
approach must be considered dubious. In the end the group indicated that in
essence reasonable metrics will be in large part a matter of good experimental
design. It must be considered essential that the expected behaviour can be
defined using some closed-form mathematical expression so that the actual network
can be compared relative to an absolute reference.

The second type of benchmarks was felt to be the easiest, because hardware and
other systems are necessarily reproducing a model that can be defined
mathematically. Platforms can then be compared relative to a reference simulator
which is considered to give the definitive solution. There was some question as
to what the reference simulator should be - certain participants had worked with
Mathematica to produce high-quality results but the exact nature of the tests for
similarity needed to be confirmed. In general however the group was in consensus
that model matching could in principle be benchmarked in this way provided
representational precision was high enough in the reference simulator.

With the third type of metric the group found more difficulty. There are issues
related to the fact that there is no absolute reference for comparison - data is
simply data. Furthermore there are issues in the case of spike comparison with
respect to spike matching: for the interval over which the modelled network and
the data produce exactly the same number of spikes then some sort of pairing could
possibly be done but once the number of spikes diverges identifying a given spike
with a given expected spike is much more problematic. Various sorts of
sliding-window comparisons could be made but this introduces a significant ad-hoc
component in the size and shape of the sliding window - e.g. all spikes could be
convolved with a Gaussian kernel but what should the width and gain of the kernel
be? The outlook here was less definite and various metrics were proposed with the
overall idea that there ought to be some sort of distance metric between the
dataset and the model but what this distance metric ought to be remained the subject
of further work.


Alexander Rast


Enrico Calabrese
Lukas Cavigelli
Gabriel Andres Fonseca Guerra
Richard George
Michael Hopkins
Giacomo Indiveri
Aleksandar Kodzhabashev
Laura Kriener
Gengting Liu
Shih-Chii Liu
Christian Mayr
Manu Nair
Felix Neumärker
Melika Payvand
Mihai Alexandru Petrovici
Alexander Rast
Alan Stokes
Evangelos Stromatias
Nikolaos Vasileiadis
Bernhard Vogginger
Qi Xu