IEEE TASLP  ·  2026

MeMo: Attentional Momentum for Real-time Audio-visual Speaker Extraction under Impaired Visual Conditions

Look once to hear — keep attention on the target speaker even when visual cues vanish.

Junjie Li* Wenxuan Wu* Shuai Wang Zexu Pan Kong Aik Lee Helen Meng Haizhou Li
The Hong Kong Polytechnic University  ·  The Chinese University of Hong Kong, Shenzhen  ·  The Chinese University of Hong Kong
* Equal contribution    † Corresponding author
📝  Paper ⚙️  Code 🎧  Listen to Demos

MeMo is a generalizable concept that plugs into various model architectures — it consistently keeps attention on the target speaker even without visual information.

Overview

Abstract

Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate a target speaker's voice from multi-speaker environments by leveraging visual cues as guidance. However, the performance of AV-TSE systems heavily relies on the quality of these visual cues. In extreme scenarios where visual cues are missing or severely degraded, the system may fail to accurately extract the target speaker. In contrast, humans can maintain attention on a target speaker even in the absence of explicit auxiliary information. Motivated by such human cognitive ability, we propose a novel framework called MeMo, which incorporates two adaptive memory banks to store attention-related information. MeMo is specifically designed for real-time scenarios: once initial attention is established, the system maintains attentional momentum over time, even when visual cues become unavailable. Comprehensive experiments show that our framework achieves SI-SNR improvements of at least 2 dB over the corresponding baseline.
How it works

Method

Two adaptive memory banks — a speaker bank and a contextual bank — turn the model's own past outputs into a persistent attention signal.

MeMo framework overview
Fig. 1: Overview of the MeMo framework. MeMo consists of two types of adaptive memory banks: the speaker bank and the contextual bank.
Attentional momentum pipeline
Fig. 2: The overall pipeline of attentional momentum during inference. The first stage uses only visual cues; later stages also reuse the estimated speech from previous stages — so even when visual cues are missing, MeMo keeps attention on the target speaker.
Benchmark

Data Simulation

We simulate realistic visual impairments and dynamic speaker-switching conversations to stress-test robustness.

Visual impairment examples
Fig. 3: Examples of visual impairments — visual missing, lip concealment, and low resolution (Gaussian noise / blurring / down-sampling).
Target speaker switching
Fig. 4: An example of target-speaker switching during a conversation. Attention shifts from one target speaker to another at a certain point, while interfering speech and background noise persist.
Experiments

Results on Online Speech Separation

MeMo delivers consistent gains across impairment types, ratios, backbones, and speaker-switching scenarios.

Contextual bank vs speaker bank
Comparative results between the contextual bank and the speaker bank.
Comparison with other separation models
Performance comparison of MeMo against other existing separation models.
Different impaired ratios
Different systems across impaired ratios (online mode).
Different visual impairments
Performance under different visual impairments.
Switching dataset
Performance under the speaker-switching dataset.
Listen

Speech Separation Demo

Watch the mixture, then compare the extracted target speakers. Best with headphones.

Normal Video

Mixture

Ground Truth 1

Ground Truth 2

Estimate 1 (MeMo)

Estimate 2 (MeMo)

Mixture

Ground Truth 1

Ground Truth 2

Estimate 1 (MeMo)

Estimate 2 (MeMo)

Impaired Video

Mixture

Ground Truth 1

Ground Truth 2

Estimate 1 (MeMo)

Estimate 2 (MeMo)

Mixture

Ground Truth 1

Ground Truth 2

Estimate 1 (MeMo)

Estimate 2 (MeMo)

Speaker Switching

Mixture

Ground Truth

Estimate — MeMo

Estimate — TDSE (baseline)