Speech technology in everyday situations – is multimodality the answer?
Speech technology, such as automatic speech recognition (ASR), has seen impressive increases in performance in the past decade. Still however, ASR architectures demonstrate poor robustness when faced with the noisy scenarios that are mundane to us humans and don’t unduly disrupt our ability to continue our conversations. Speech is not something we just hear though, it’s also something we see. In a conversation, we seamlessly signal and monitor a multitude of visual cues including eye gaze, facial expression, hand gestures and head nods. In parallel, we interpret linguistic information and prosody. In a noisy room, we pay attention to lip movements. In this talk, I examine how we can integrate multimodality when developing more robust speech technology. In particular, I will focus on my group’s recent and ongoing work in audio-visual speech recognition and conversational analysis. I’ll also discuss how by truly understanding the nature of speech interaction, and by working in multidisciplinary teams, that we can build neural architectures better suited to this challenging task.