
Post-generation ASR Hypothesis Reranking Utilizing Visual Contexts (3.0)
Marcus Au * , Tommy Shu * , Yirui Song , Zitong Huang , Catherine Lu
* Project Lead
Speech Recognition models are capable of producing multiple transcripts. Often, the best transcript is not the one the model thinks is best, but rather a different transcript in the top-k generated transcripts. Typical approaches to re-scoring these transcripts to maximize ASR performance use language modeling. This project aims to improve the re-ranking process using an additional modality: visual input, specifically in the embodied agent domain. This work is an extension of previous work done by CAIS++ in speech recognition for embodied agents in Automatic Speech Recognition (ASR) pipeline proposed in 'Multimodal Speech Recognition for Language-Guided Embodied Agents' (Chang et al.).