Skip to main navigation Skip to search Skip to main content

Listen to the Speaker in Your Gaze

  • Hongli Yang
  • , Xinyi Chen
  • , Junjie Li
  • , Hao Huang
  • , Siqi Cai*
  • , Haizhou Li
  • *Corresponding author for this work
  • Shenzhen Research Institute of Big Data
  • Xinjiang University
  • National University of Singapore
  • The Chinese University of Hong Kong, Shenzhen

Research output: Contribution to journalConference articlepeer-review

Abstract

Attending to one's voice in a cocktail party is notably challenging, particularly for individuals with hearing impairments. This paper proposes a novel eye-controlled target speaker extraction system, which consists of an eye-tracker, face detection model, Active Speaker Detection (ASD), and Target Speaker Extraction (TSE) model. The system employs the eye-tracker to capture real-time video together with the listener's gaze. This gaze data then allows the face detection model to locate and isolate the target speaker's face within the video on a frame-by-frame basis. Using the speaker's face as the reference cue, the system can discern and separate his/her speech from a mixture of multi-talk. The experiments show that the system effectively extracts the target speaker's speech in complex auditory environments, providing both real-time performance and accuracy. A demonstration of our system is available on our website.

Keywords

  • Cocktail Party
  • Eye-Tracker
  • Multi-modal
  • Target Speaker Extraction

Fingerprint

Dive into the research topics of 'Listen to the Speaker in Your Gaze'. Together they form a unique fingerprint.

Cite this