Language Models for Controllable DNA Sequence Design
By: Xingyu Su , Xiner Li , Yuchao Lin and more
Potential Business Impact:
Designs DNA to do specific jobs.
We consider controllable DNA sequence design, where sequences are generated by conditioning on specific biological properties. While language models (LMs) such as GPT and BERT have achieved remarkable success in natural language generation, their application to DNA sequence generation remains largely underexplored. In this work, we introduce ATGC-Gen, an Automated Transformer Generator for Controllable Generation, which leverages cross-modal encoding to integrate diverse biological signals. ATGC-Gen is instantiated with both decoder-only and encoder-only transformer architectures, allowing flexible training and generation under either autoregressive or masked recovery objectives. We evaluate ATGC-Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on ChIP-Seq experiments for modeling protein binding specificity. Our experiments demonstrate that ATGC-Gen can generate fluent, diverse, and biologically relevant sequences aligned with the desired properties. Compared to prior methods, our model achieves notable improvements in controllability and functional relevance, highlighting the potential of language models in advancing programmable genomic design. The source code is released at (https://github.com/divelab/AIRS/blob/main/OpenBio/ATGC_Gen).
Similar Papers
GENERator: A Long-Context Generative Genomic Foundation Model
Computation and Language
Helps scientists understand and change DNA code.
ControllableGPT: A Ground-Up Designed Controllable GPT for Molecule Optimization
Machine Learning (CS)
Helps find new medicines by changing computer text.
Can Large Language Models Predict Antimicrobial Resistance Gene?
Computation and Language
Helps scientists understand DNA better using smart computer language.