SmilesT5: Domain-specific pretraining for molecular language models
By: Philip Spence, Brooks Paige, Anne Osbourn
Potential Business Impact:
Teaches computers to guess drug effects faster.
Molecular property prediction is an increasingly critical task within drug discovery and development. Typically, neural networks can learn molecular properties using graph-based, language-based or feature-based methods. Recent advances in natural language processing have highlighted the capabilities of neural networks to learn complex human language using masked language modelling. These approaches to training large transformer-based deep learning models have also been used to learn the language of molecules, as represented by simplified molecular-input line-entry system (SMILES) strings. Here, we present novel domain-specific text-to-text pretraining tasks that yield improved performance in six classification-based molecular property prediction benchmarks, relative to both traditional likelihood-based training and previously proposed fine-tuning tasks. Through ablation studies, we show that data and computational efficiency can be improved by using these domain-specific pretraining tasks. Finally, the pretrained embeddings from the model can be used as fixed inputs into a downstream machine learning classifier and yield comparable performance to finetuning but with much lower computational overhead.
Similar Papers
Transformers for molecular property prediction: Domain adaptation efficiently improves performance
Machine Learning (CS)
Finds better medicines faster by learning from drug data.
Enhancing Molecular Property Prediction with Knowledge from Large Language Models
Computation and Language
Finds new medicines faster using smart computer knowledge.
Dual-Modality Representation Learning for Molecular Property Prediction
Machine Learning (CS)
Helps find new medicines faster by combining two ways.