Score: 0

Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis

Published: January 20, 2026 | arXiv ID: 2601.14227v1

By: Theodore Aptekarev, Vladimir Sokolovsky, Gregory Furman

Potential Business Impact:

Helps doctors hear lung problems better with AI.

Business Areas:

Speech Recognition Data and Analytics, Software

Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201, which demonstrated high sensitivity in classifying respiratory sounds. In this work, we (i) adapt the Audio Spectrogram Transformer (AST) for respiratory sound analysis and (ii) evaluate a multimodal Vision-Language Model (VLM) that integrates spectrograms with structured patient metadata. AST is initialized from publicly available weights and fine-tuned on a medical dataset containing hundreds of recordings per diagnosis. The VLM experiment uses a compact Moondream-type model that processes spectrogram images alongside a structured text prompt (sex, age, recording site) to output a JSON-formatted diagnosis. Results indicate that AST achieves approximately 97% accuracy with an F1-score around 97% and ROC AUC of 0.98 for asthma detection, significantly outperforming both the internal CNN baseline and typical external benchmarks. The VLM reaches 86-87% accuracy, performing comparably to the CNN baseline while demonstrating the capability to integrate clinical context into the inference process. These results confirm the effectiveness of self-attention for acoustic screening and highlight the potential of multimodal architectures for holistic diagnostic tools.

Explainable Multi-Modal Deep Learning for Automatic Detection of Lung Diseases from Respiratory Audio Signals

Sound

Helps doctors find lung problems using breathing sounds.

29 Nov 2025 0

91%

A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification

Signal Processing

Helps doctors find sick kids' lung problems.

27 Jul 2025 0

91%

Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers

Audio and Speech Processing

Helps doctors hear lung problems better.

27 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇮🇱 Israel

Page Count

7 pages

Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis

Helps doctors hear lung problems better with AI.

Technical Abstract

Explainable Multi-Modal Deep Learning for Automatic Detection of Lung Diseases from Respiratory Audio Signals

A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification

Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers