SPRING 2023

Multimodal Speech Recognition for Language-Guided Embodied Agents (2.0)

Allen Chang * , Xiaoyuan Zhu , Aarav Monga , Seoho Ahn , Tejas Srinivasan *Jesse Thomason * 
* Project Lead    * Project Advisor

Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents' ability to complete tasks. In this work, we propose training a multimodal ASR model to reduce errors in transcribing spoken instructions by considering the accompanying visual context. We train our model on a dataset of spoken instructions, synthesized from the ALFRED task completion dataset, where we simulate acoustic noise by systematically masking spoken words. We find that utilizing visual observations facilitates masked word recovery, with multimodal ASR models recovering up to 30% more masked words than unimodal baselines. We also find that a text-trained embodied agent successfully completes tasks more often by following transcribed instructions from multimodal ASR models.

Computer Vision for Quality Assurance

Jaiv Doshi * , Irika Katiyar * , Jarret Spino , Seena Pourzand , Hilari Fan , Sanya Verma
* Project Lead   

We are partnering with Wintec Industries to introduce an automated system for quality assurance on their manufactured computer modules, including PCB boards, SSD drives, and other hardware components. Currently, the main method of quality assurance is done through manual inspection. With manual inspection, there are a few main limitations: 1) throughput is slow; 2) reliability is variable, especially when considering worker fatigue; 3) manual, repeatable labor can be costly. We want to introduce a computer vision system to automatically detect damages to these modules. There are two main components to this project: 1) consulting with Wintec to advise the optimal hardware for data collection; 2) using the collected data, construct a deep learning model to recognize damages and defects.

Indigenous Language Translation with Sparse Data (2.0)

Aryan Gulati * , Leslie Moreno * , Abhinav Gupta , Nathan Huh , Aditya Kumar , Sana Jayaswal , Jonathan May * 
* Project Lead    * Project Advisor

Imperialism has led to a loss of many indigenous cultures and with this, their languages. Based on the NeurIPS 2022 Competition “Second AmericasNLP Competition: Speech-to-Text Translation for Indigenous Languages of the Americas,” this project aims to use machine translation (MT) and automatic speech recognition (ASR) approaches to develop a translator for endangered or extinct indigenous languages. This will involve finding and/or building an appropriately sized corpus and using this to train MT and ASR models due to the sparsity of data on these indigenous languages.

Zero-Shot Robot Navigation

Leo Zhuang * , Nathaniel Johnson , Pratyush Jaishanker , Rajakrishnan Somou , Matthew Rodriguez , Jonathan Ong , Kaustav Chakraborty *Somil Bansal * 
* Project Lead    * Project Advisor

Robots have many challenges to face when navigating unknown environments, such as moving obstacles and uneven terrain. Currently, robots can leverage computer vision to generate optimal waypoints that robots can follow to reach their goal. However, this approach fails when robots navigate uncommon objects and close proximity areas. The project aims to incorporate multimodal sensor input to make navigation more robust in unknown environments.

Predicting Californian Wildfires

Advik Unni * , Sonia Zhang , George Danzelaud , Guillermo Basterra , Brice Patchou
* Project Lead   

When temperatures rise, extreme weather patterns have become ever so pervasive. Every year, thousands of wildfires are sparked in the state of california, leading to millions of acres being burned and accumulating to billions of dollars in property damage. Effective surveillance and prediction can help public service workers prevent and alleviate wildfires. This project aims to use historical fire trends, coupled with air quality metrics and meteorological data to forecast wildfires in the state of california. This will help lawmakers implement policy to deter fires and allocate resources to high risk areas.

Emote Recommendation in Live Stream Chats

Jessica Fu * , Joseph Gozon , Lucia Zhang
* Project Lead   

Recommender systems make a large impact in human communication (e.g., autocomplete), shopping (e.g., targeted advertisements), recommended news coverage (e.g., CNN vs Fox News), and entertainment (e.g., Netflix recommendations). This project plans to help teach and implement the mechanisms of recommender systems in the context of emoji autocompletion in live stream chats.

Slide Generation for Presentations

Aryan Gulati * , Jayne Bottarini , Claude Yoo , Naina Panjwani , Mia Angelucci
* Project Lead   

We are working on using AI to make the process of creating presentations more efficient, a large time-sink in-team and project communication. Our goal is to develop an AI system that generates professional and visually appealing slides from textual input. This system will build on advancements in natural language processing and computer vision to augment generative models to produce slides that communicate the message effectively, thus helping to save time on manual slide creation and allow users to focus on delivering their message.

FALL 2022

Multimodal Speech Recognition for Language-Guided Embodied Agents (1.0)

Allen Chang * , Xiaoyuan Zhu , Aarav Monga , Seoho Ahn , Tejas Srinivasan *Jesse Thomason * 
* Project Lead    * Project Advisor

Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents' ability to complete tasks. In this work, we propose training a multimodal ASR model to reduce errors in transcribing spoken instructions by considering the accompanying visual context. We train our model on a dataset of spoken instructions, synthesized from the ALFRED task completion dataset, where we simulate acoustic noise by systematically masking spoken words. We find that utilizing visual observations facilitates masked word recovery, with multimodal ASR models recovering up to 30% more masked words than unimodal baselines. We also find that a text-trained embodied agent successfully completes tasks more often by following transcribed instructions from multimodal ASR models.

ProjectX: Stress Recognition for Health Workers using Human Activity Recognition

Jordan Cahoon * , Josheta Srinivasan , Armando Chirinos , Jonathan Qin , Luis Garcia * 
* Project Lead    * Project Advisor

ProjectX is the world’s largest undergraduate machine learning research competition with competing teams from top universities across the world. The winning team from each of the three subtopic focuses will be awarded a cash prize of CAD $25,000, and all participants will be invited to attend the annual UofT AI Conference in January 2023 where the ProjectX award ceremony will take place. Last year we had ~800 participants and our keynote speaker was Geoffrey Hinton.

Computer Vision for Quality Assurance

Eric Cheng * , Jaiv Doshi * , Jarret Spino , Seena Pourzand , Irika Katiyar , Leo Zhuang
* Project Lead   

We are partnering with Wintec Industries to introduce an automated system for quality assurance on their manufactured computer modules, including PCB boards, SSD drives, and other hardware components. Currently, the main method of quality assurance is done through manual inspection. With manual inspection, there are a few main limitations: 1) throughput is slow; 2) reliability is variable, especially when considering worker fatigue; 3) manual, repeatable labor can be costly. We want to introduce a computer vision system to automatically detect damages to these modules. There are two main components to this project: 1) consulting with Wintec to advise the optimal hardware for data collection; 2) using the collected data, construct a deep learning model to recognize damages and defects.

Indigenous Language Translation with Sparse Data (1.0)

Aryan Gulati * , Leslie Moreno * , Abhinav Gupta , Nathan Huh , Zaid Abdulrehman
* Project Lead   

Imperialism has led to a loss of many indigenous cultures and with this, their languages. Based on the NeurIPS 2022 Competition “Second AmericasNLP Competition: Speech-to-Text Translation for Indigenous Languages of the Americas,” this project aims to use machine translation (MT) and automatic speech recognition (ASR) approaches to develop a translator for endangered or extinct indigenous languages. This will involve finding and/or building an appropriately sized corpus and using this to train MT and ASR models due to the sparsity of data on these indigenous languages.