Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning
By: Xuan Lin , Qingrui Liu , Hongxin Xiang and more
Potential Business Impact:
Helps scientists invent new medicines faster.
Chemical reaction and retrosynthesis prediction are fundamental tasks in drug discovery. Recently, large language models (LLMs) have shown potential in many domains. However, directly applying LLMs to these tasks faces two major challenges: (i) lacking a large-scale chemical synthesis-related instruction dataset; (ii) ignoring the close correlation between reaction and retrosynthesis prediction for the existing fine-tuning strategies. To address these challenges, we propose ChemDual, a novel LLM framework for accurate chemical synthesis. Specifically, considering the high cost of data acquisition for reaction and retrosynthesis, ChemDual regards the reaction-and-retrosynthesis of molecules as a related recombination-and-fragmentation process and constructs a large-scale of 4.4 million instruction dataset. Furthermore, ChemDual introduces an enhanced LLaMA, equipped with a multi-scale tokenizer and dual-task learning strategy, to jointly optimize the process of recombination and fragmentation as well as the tasks between reaction and retrosynthesis prediction. Extensive experiments on Mol-Instruction and USPTO-50K datasets demonstrate that ChemDual achieves state-of-the-art performance in both predictions of reaction and retrosynthesis, outperforming the existing conventional single-task approaches and the general open-source LLMs. Through molecular docking analysis, ChemDual generates compounds with diverse and strong protein binding affinity, further highlighting its strong potential in drug design.
Similar Papers
LLM-Augmented Chemical Synthesis and Design Decision Programs
Artificial Intelligence
Computers plan how to build new medicines faster.
Leveraging Large Language Models for enzymatic reaction prediction and characterization
Artificial Intelligence
Helps computers guess how tiny body machines work.
Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration
Machine Learning (CS)
Teaches computers to invent new medicines.