P42Session 2 (Friday 12 January 2024, 09:00-11:30)A phoneme-scale evaluation of multichannel speech enhancement algorithms
The impairment of auditory function resulting from hearing loss significantly undermines the capacity for speech comprehension, especially in scenarios where speech is mixed with various competing sounds. Several signal alterations, such as spread spectrum or masking phenomena, can make it challenging to differentiate between phoneme formants due to their overlapping frequencies. In this regard, speech enhancement appears as a promising solution to mitigate the adverse effects of ambient noise on the intelligibility and clarity of spoken language. Recent advances driven by deep learning empirically support the effectiveness of enhancement models in improving speech intelligibility in complex acoustic environments.
Such algorithms are typically evaluated for their ability to restore intelligibility and speech quality for individuals with normal hearing. Consequently, these assessment strategies may not always offer relevant insights for individuals suffering from hearing impairments. In particular, models are commonly evaluated at the utterance level, which aggregates errors across diverse phonemes. This results in potentially overlooking certain phonemic categories, which are of particular importance for individuals with impairments. Indeed, the influence of noise and its mitigation differs among phonemes, owing to their signal-level attributes which stem from their specific production model in the vocal tract. For impaired individuals, these differences are particularly salient due to reduced spectral and temporal resolutions, and the occurrence of masking effects.
To overcome this issue, in this study we perform a comprehensive assessment of speech enhancement algorithms at the phoneme level. We categorize phonemes according to their distinct articulatory models (e.g., plosives, fricatives, nasals) as it creates groups with similar signal characteristics. We consider four state-of-the-art multichannel speech enhancement models. Using publicly available datasets of clean speech and real-life noise, we simulate noisy mixtures in order to encompass various spatial conditions. In addition to the commonly-used utterance scale, we evaluate the models’ performance in terms of distortion, artifacts and interference reduction at the proposed fine-grained phonemic scale. This outlines the algorithms’ effectiveness in reducing noise interference according to specific phoneme-level features.
To summarize, this study tackles the challenges posed by hearing disorders, particularly the intricacies of phonemic decoding processes within the cochlea and the brain. It marks an initial step towards advancing speech enhancement models, as it enables the identification of specific speech components that require further emphasis.