Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning
By: Zubia Naz , Farhan Asghar , Muhammad Ishfaq Hussain and more
Potential Business Impact:
Helps doctors describe medical pictures faster.
Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\_repeat\_ngram\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.
Similar Papers
Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
Image and Video Processing
Creates doctor reports from MRI scans.
Radiology Report Generation with Layer-Wise Anatomical Attention
CV and Pattern Recognition
Helps doctors write X-ray reports faster.
Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers
CV and Pattern Recognition
Finds sickness in medical pictures faster.