MeMo: Attentional Momentum for Real-time Audio-visual Speaker Extraction under Impaired Visual Conditions

Zexu Pan
Kong Aik Lee
Helen Meng
Haizhou Li
*: Equal contribution.
+: Coressponding author.
The Hong Kong Polytechnic University, Hong Kong SAR
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen)
The Chinese University of Hong Kong, Hong Kong SAR

MeMo is a generalizable concept that can be integrated into various model architectures. It could consistantly keep attention on target speaker even without visual information.


Abstract

Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate a target speaker's voice from multi-speaker environments by leveraging visual cues as guidance. However, the performance of AV-TSE systems heavily relies on the quality of these visual cues. In extreme scenarios where visual cues are missing or severely degraded, the system may fail to accurately extract the target speaker. In contrast, humans can maintain attention on a target speaker even in the absence of explicit auxiliary information. Motivated by such human cognitive ability, we propose a novel framework called MeMo, which incorporates two adaptive memory banks to store attention-related information. MeMo is specifically designed for real-time scenarios: once initial attention is established, the system maintains attentional momentum over time, even when visual cues become unavailable. We conduct comprehensive experiments to verify the effectiveness of MeMo. Experimental results demonstrate that our proposed framework achieves SI-SNR improvements of at least 2 dB over the corresponding baseline.

Fig.1: Overview of MeMo framework. MeMo consists of two types of adaptive memory banks: speaker bank and contextual bank.

Fig.2: The overall pipeline of Attentional Momentum during inference stage. In the first stage, only visual cues are utilized. But in the next stages, the estimated speech from the last stage is also could be utilized. Hence, even visual cues are missed, MeMo is still able to maintain attention on the target speaker.

Data Simulation

Fig 3: Examples of visual impairments.
Fig 4: An example of target speaker switching during a con- versation is illustrated in this figure. Attention shifts from one target speaker to another at a certain point, while interfering speech or background noise persists.

Results on Online Speech Separation

Comparative Results between Contextual Bank and Speaker Bank.
The performance of different systems on different impaired ratios in online mode.
Performance comparisons of MeMo and other existing separation models.
Performance under different visual impairments.
Performance under switching dataset.

Speech Separation Demo

Normal Video

Mixture

Ground Truth1

Ground Truth2

Estimate1 Estimate2


Mixture

Ground Truth1

Ground Truth2

Estimate1 Estimate2

Impaired Video

Mixture

Ground Truth1

Ground Truth2

Estimate1 Estimate2


Mixture

Ground Truth1

Ground Truth2

Estimate1 Estimate2

Switch-video

Mixture

Ground Truth1

Estimate1 MeMo Estimate1 TDSE