Score: 1

Segmental Attention Decoding With Long Form Acoustic Encodings

Published: December 16, 2025 | arXiv ID: 2512.14652v1

By: Pawel Swietojanski , Xinwei Li , Mingbin Xu and more

BigTech Affiliations: Apple

Potential Business Impact:

Lets AI understand long speech without breaking it up.

Business Areas:

Audio Media and Entertainment, Music and Audio

We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.

Performance Modeling for Correlation-based Neural Decoding of Auditory Attention to Speech

Signal Processing

Helps hearing aids focus on who you're listening to.

12 Mar 2025 0

86%

Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models

Sound

Helps computers understand *when* things happen in audio.

14 Nov 2025 1

86%

Structure-Aware Decoding Mechanisms for Complex Entity Extraction with Large-Scale Language Models

Computation and Language

Helps computers understand complex sentences better.

16 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

5 pages

Segmental Attention Decoding With Long Form Acoustic Encodings

Lets AI understand long speech without breaking it up.

Technical Abstract

Performance Modeling for Correlation-based Neural Decoding of Auditory Attention to Speech

Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models

Structure-Aware Decoding Mechanisms for Complex Entity Extraction with Large-Scale Language Models