Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension
By: Mengren , Liu , Yixiang Zhang and more
Recent advances in protein language models (PLMs) have demonstrated remarkable capabilities in understanding protein sequences. However, the extent to which different model architectures capture antibody-specific biological properties remains unexplored. In this work, we systematically investigate how architectural choices in PLMs influence their ability to comprehend antibody sequence characteristics and functions. We evaluate three state-of-the-art PLMs-AntiBERTa, BioBERT, and ESM2--against a general-purpose language model (GPT-2) baseline on antibody target specificity prediction tasks. Our results demonstrate that while all PLMs achieve high classification accuracy, they exhibit distinct biases in capturing biological features such as V gene usage, somatic hypermutation patterns, and isotype information. Through attention attribution analysis, we show that antibody-specific models like AntiBERTa naturally learn to focus on complementarity-determining regions (CDRs), while general protein models benefit significantly from explicit CDR-focused training strategies. These findings provide insights into the relationship between model architecture and biological feature extraction, offering valuable guidance for future PLM development in computational antibody design.
Similar Papers
Antibody Foundational Model : Ab-RoBERTa
Machine Learning (CS)
Helps make new medicines from antibodies.
Machine learning approaches for interpretable antibody property prediction using structural data
Quantitative Methods
Helps design better medicines by understanding body's defenders.
Mitigating the Antigenic Data Bottleneck: Semi-supervised Learning with Protein Language Models for Influenza A Surveillance
Machine Learning (CS)
Predicts flu virus changes for better vaccines.