Cortical representations of acoustic stimuli : new concepts from the study of natural sounds

Listen to audio

We first looked at the perception of pure sounds, then at that of communicative sounds: vocalizations, birdsong and speech. The prediction of the efficient coding hypothesis (Barlow H.B., 1961), according to which the processing of signals by sensory systems derives its principles from its adaptation to the natural environment - and consequently reduces, by statistical analysis, the redundancy that characterizes it -, has been at the origin of many advances in the understanding of the functioning of the visual system. Over the last ten years, it has inspired research into the perception of natural acoustic environments, with the aim of identifying the characteristics of the sound sequences extracted by the auditory system and understanding how they are analyzed. Natural sounds, like natural images, exhibit regularities, manifested for example by temporal correlations, such as amplitude covariations in different frequency bands.

The work of Josh H. McDermott and Eero Simoncelli (McDermott J.H., Simoncelli E.P., Neuron, 2011) on the perception of textures in natural sounds (rain, crackling fire, etc.) caught our attention. They are inspired by those carried out on visual textures, since the late 1980s. Visual textures are defined as the superposition of a large number of patterns that are repeated in an image, in a more or less regular and more or less complex way. This work introduced the notion of the statistical characteristics of images. Sound textures correspond to sounds produced by the superposition of a large number of simple sound patterns, which combine to collectively generate statistical properties. These statistical characteristics enable the neural coding of a reduced amount of information, which can then be transferred quickly and efficiently. Eero Simoncelli and his colleagues have developed a "bioinspired" model of sound processing. This has enabled them to identify the statistical characteristics of sound textures that are most informative in distinguishing between different textures. These are mainly the marginal moments and correlations of the amplitude modulation of sound signals in various frequency bands and sub-bands. The validity of the model and that of the selected statistical parameters were established by means of recognition tests on synthetic sounds, white noise to which the selected parameters are applied, and whose values are adjusted to those that the model can extract from the corresponding natural sound texture. Overall, the results show that the auditory system can recognize sound textures using only statistical information. It's easy to see why this parsimonious representation is so useful for memorizing sound sequences. What's more, these more abstract and compressed brain representations should facilitate sensory recognition processes and their integration into multisensory representations. Alongside a concept of acoustic signal processing that attributes a central role in recognition and memorization to its fine spectrotemporal structure, another emerges that confers this role to its statistical regularities (here analyzed by 7 parameters). Later, the same authors substantiated and generalized their findings (McDermott J.H., Schemitsch M. and Simoncelli E.P., Nature Neuroscience, 2013). Thus, while discrimination of two samples from the same synthetic texture, generated by applying the model to a natural sound texture, is possible when listening is limited to around fifty milliseconds, it disappears if listening is prolonged. Conversely, the discrimination of two different synthetic sound textures increases with listening time. These results are consistent with the idea of averaging by the auditory system. They provide an explanation for the well-established ability to discriminate between two white noises, provided that listening time is limited to around 100 milliseconds. When the density of acoustic signals is high and their duration long, the public loses access to the details of their spectro-temporal structure and becomes dependent on their statistical representation. The benefit of the loss of information associated with the statistical processing of sound sequences has been discussed.

The concept of the acoustic object also caught our attention. Auditory objects correspond to sensory experiences, perceptual entities that depend on the brain mechanisms available to represent and analyze sensory information. As a result, the concepts of object and object analysis are inseparable. Griffith and Warren have proposed four conditions required to identify an auditory object (Griffiths T.D., Warren J.D., Nature Reviews Neuroscience, 2004). Object analysis implies :

it involves elements of the sensory world ;
that the neural representation related to this object is separate from that related to other objects;
that the representation of this object is such that it can be generalized and shared (for example, we need to be able to recognize a face from any angle and in any light);
the acoustic representation of the object can be shared with the representations that emanate from its analysis by other sensory modes.

In this way, for example, an individual's face and voice can be brought closer together. A work that tested the idea of acoustic object representation, taking as its object the speech stream of an individual in a cocktail party situation, was discussed (Ding N. and Simon J.Z., PNAS, 2012). In the chosen situation, the speech of two speakers mingles. The cortical neuronal response is analyzed by magnetoencephalography, which offers excellent temporo-spatial resolution (around 1 ms and 2-3 mm). The representation of an acoustic object requires the grouping of each speaker's speech into a distinct acoustic stream. The independence of the auditory object is tested using a modulation process that applies only to the speech of one of the two speakers (variation in the attention paid to the speech of a given speaker, variation in the intensity of the speech of one of the two speakers). The hypothesis developed states that, in certain auditory cortical regions, the representation of the auditory object leads to neuronal activity that correlates temporally with the sound envelope of the speech of only one of the two speakers; neuronal activity is cadenced and follows the rhythm of speech(phase locking). In other cortical regions, on the other hand, the neuronal response should be correlated with the overall sound stimulation, i.e. the mixture of the two sound envelopes. The results obtained show that there is indeed a separate coding of the speech of each of the two speakers in the auditory cortex; attention to one of the two speakers increases the corresponding coding. The experiments also suggest that the neural representation of a speech stream is stable, irrespective of changes in speech intensity, and whether it is listened to or ignored. The neural representations in the auditory cortex are therefore not simply the representation of the physical stimuli (which would then be that of the two superimposed utterances), but are indeed those of the sound envelope of each of the speech streams. These results provide a powerful argument for the notion of an auditory object.