Score: 1

Evaluating DNA function understanding in genomic language models using evolutionarily implausible sequences

Published: June 12, 2025 | arXiv ID: 2506.10271v3

By: Shiyu Jiang, Xuyin Liu, Zitong Jerry Wang

Potential Business Impact:

Helps computers design new DNA that works.

Business Areas:
Bioinformatics Biotechnology, Data and Analytics, Science and Engineering

Genomic language models (gLMs) hold promise for generating novel, functional DNA sequences for synthetic biology. However, realizing this potential requires models to go beyond evolutionary plausibility and understand how DNA sequence encodes gene expression and regulation. We introduce a benchmark called Nullsettes, which assesses how well models can predict in silico loss-of-function (LOF) mutations, in synthetic expression cassettes with little evolutionary precedent. Testing 12 state-of-the-art gLMs, we find that most fail to consistently detect these strong LOF mutations. All models show a sharp drop in predictive accuracy as the likelihood assigned to the original (nonmutant) sequence decreases, suggesting that gLMs rely heavily on pattern-matching to their evolutionary prior rather than on any mechanistic understanding of gene expression. Our findings highlight fundamental limitations in how gLMs generalize to engineered, non-natural sequences, and underscore the need for benchmarks and modeling strategies that prioritize functional understanding.

Country of Origin
🇺🇸 United States

Page Count
19 pages

Category
Quantitative Biology:
Quantitative Methods