We first looked at the perception of pure sounds, then at that of communicative sounds: vocalizations, birdsong and speech. The prediction of the efficient coding hypothesis (Barlow H.B., 1961), according to which the processing of signals by sensory systems derives its principles from its adaptation to the natural environment - and consequently reduces, by statistical analysis, the redundancy that characterizes it -, has been at the origin of many advances in the understanding of the functioning of the visual system. Over the last ten years, it has inspired research into the perception of natural acoustic environments, with the aim of identifying the characteristics of the sound sequences extracted by the auditory system and understanding how they are analyzed. Natural sounds, like natural images, exhibit regularities, manifested for example by temporal correlations, such as amplitude covariations in different frequency bands.
The work of Josh H. McDermott and Eero Simoncelli (McDermott J.H., Simoncelli E.P., Neuron, 2011) on the perception of textures in natural sounds (rain, crackling fire, etc.) caught our attention. They are inspired by those carried out on visual textures, since the late 1980s. Visual textures are defined as the superposition of a large number of patterns that are repeated in an image, in a more or less regular and more or less complex way. This work introduced the notion of the statistical characteristics of images. Sound textures correspond to sounds produced by the superposition of a large number of simple sound patterns, which combine to collectively generate statistical properties. These statistical characteristics enable the neural coding of a reduced amount of information, which can then be transferred quickly and efficiently. Eero Simoncelli and his colleagues have developed a "bioinspired" model of sound processing. This has enabled them to identify the statistical characteristics of sound textures that are most informative in distinguishing between different textures. These are mainly the marginal moments and correlations of the amplitude modulation of sound signals in various frequency bands and sub-bands. The validity of the model and that of the selected statistical parameters were established by means of recognition tests on synthetic sounds, white noise to which the selected parameters are applied, and whose values are adjusted to those that the model can extract from the corresponding natural sound texture. Overall, the results show that the auditory system can recognize sound textures using only statistical information. It's easy to see why this parsimonious representation is so useful for memorizing sound sequences. What's more, these more abstract and compressed brain representations should facilitate sensory recognition processes and their integration into multisensory representations. Alongside a concept of acoustic signal processing that attributes a central role in recognition and memorization to its fine spectrotemporal structure, another emerges that confers this role to its statistical regularities (here analyzed by 7 parameters). Later, the same authors substantiated and generalized their findings (McDermott J.H., Schemitsch M. and Simoncelli E.P., Nature Neuroscience, 2013). Thus, while discrimination of two samples from the same synthetic texture, generated by applying the model to a natural sound texture, is possible when listening is limited to around fifty milliseconds, it disappears if listening is prolonged. Conversely, the discrimination of two different synthetic sound textures increases with listening time. These results are consistent with the idea of averaging by the auditory system. They provide an explanation for the well-established ability to discriminate between two white noises, provided that listening time is limited to around 100 milliseconds. When the density of acoustic signals is high and their duration long, the public loses access to the details of their spectro-temporal structure and becomes dependent on their statistical representation. The benefit of the loss of information associated with the statistical processing of sound sequences has been discussed.