
Toward Robust Multimodal Speech Recognition for Embodied Agents
Allen Chang *, Xiaoyuan Zhu, Aarav Monga, Seoho Ahn
* Project Lead
In Vision Language Navigation (VLN) tasks, an embodied agent must navigate a 3D environment by utilizing both natural language instruction given from an oracle and visual observation of the surroundings. Due to the difficulty of the task, VLN agents are trained under the assumption that the oracle will offer standard instructions. For text-based instructions, this means commands contain few content errors and are written in impeccable grammar. Adding speech into the mix introduces another layer of complexity. Since speech input widely varies between oracles, VLN agents can struggle with decoding meaning from this form of instruction. This makes training a VLN agent on speech particularly challenging. Yet in VLN, the agent has access to visual observations from the environment, which can aid in determining plausible meaning during moments of ambiguous instruction. Our solution is to expand this intuition and develop a robust Automatic Speech Recognition (ASR) model that will utilize visual context to recover semantic meaning from corrupted commands. Ultimately, our project aims to make VLN agents more robust to non-standard speech instruction.