P20Session 2 (Friday 12 January 2024, 09:00-11:30)Reconstruction of speech from a few channels using a single-speaker neural vocoder
Background: Speech has been shown to be intelligible when it is noise-vocoded in only three or four channels. Because of this, neural vocoders should in theory be able to generate high-quality speech from such small spectral representations when phonetic content is the only variable that changes during training but other attributes like speaker identity or speaking style are kept constant.
Method: In the present study, WaveGlow, a neural vocoder based on normalizing flows, was trained on Mel spectrograms with two, three, four, eight, or the typically used eighty channels on the LJ-Speech corpus. Twenty naive participants evaluated these sounds produced by WaveGlow and sounds processed by a noise vocoder with the same numbers of channels in an online experiment. They rated speech quality on a five-category category scale (excellent, good, fair, bad, poor) and reported the percentage of words recognized in a second run. Each condition was presented eight times in each run, the assignment of speech segments to conditions was randomized between participants and no speech segment was repeated within a participant.
Results: WaveGlow produced sounds with considerably higher sound quality and intelligibility than noise vocoders with the same number of channels in subjective and objective ratings. The participants rated the sound quality of the sounds produced by WaveGlow one to two categories higher than those produced by the noise vocoder, and reported to recognize 48 % of the words for two channels and 81 % for three channels for WaveGlow, but only 74 % for a noise vocoder with eight channels. The Short-Term Objective Intelligibility (STOI) metric showed a similar pattern than the mean opinion scores for sound quality.
Conclusions: Altogether, this shows that the neural vocoder successfully learned some general features of speech that are useful for naive listeners. However, the sounds produced by WaveGlow based on eight channels were only of "fair" quality despite near-perfect intelligibility and twice the training time as that for eighty channels, which may not make the reduced representation sufficient for use in text-to-speech systems.