Protein as a Second Language for LLMs
By: Xinhui Chen , Zuchao Li , Mengqi Gao and more
Potential Business Impact:
Helps computers understand how proteins work.
Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the "Protein-as-Second-Language" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.
Similar Papers
Swarms of Large Language Model Agents for Protein Sequence Design with Experimental Validation
Artificial Intelligence
Creates new proteins for medicine and materials.
Aligning Proteins and Language: A Foundation Model for Protein Retrieval
Biomolecules
Finds protein jobs from their shapes.
ProtTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models
Biomolecules
Helps computers understand and design proteins by reading their shapes.