Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
By: Yogesh Thakku Suresh , Vishwajeet Shivaji Hogale , Luca-Alexandru Zamfira and more
Potential Business Impact:
Creates doctor reports from MRI scans.
We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
Similar Papers
Transformers for Multimodal Brain State Decoding: Integrating Functional Magnetic Resonance Imaging Data and Medical Metadata
Machine Learning (CS)
Reads brain activity and patient info together.
Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation
CV and Pattern Recognition
Makes computers describe pictures in many languages.
Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning
CV and Pattern Recognition
Helps doctors describe medical pictures faster.