One Small Step with Fingerprints, One Giant Leap for emph{De Novo} Molecule Generation from Mass Spectra
By: Neng Kai Nigel Neo , Lim Jing , Ngoui Yong Zhau Preston and more
Potential Business Impact:
Finds new molecules from chemical fingerprints.
A common approach to the \emph{de novo} molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt \textsc{MIST}~\citep{MISTgoldmanAnnotatingMetaboliteMass2023} as the encoder and \textsc{MolForge}~\citep{ucakReconstructionLosslessMolecular2023} as the decoder, leveraging pretraining to enhance performance. Notably, pretraining \textsc{MolForge} proves especially effective, enabling it to serve as a robust fingerprint-to-structure decoder. Additionally, instead of passing the probability of each bit in the fingerprint, thresholding the probabilities as a step function helps focus the decoder on the presence of substructures, improving recovery of accurate molecular structures even when the fingerprints predicted by \textsc{MIST} only moderately resembles the ground truth in terms of Tanimoto similarity. This combination of encoder and decoder results in a tenfold improvement over previous state-of-the-art methods, generating top-1 28\% / top-10 36\% of molecular structures correctly from mass spectra. We position this pipeline as a strong baseline for future research in \emph{de novo} molecule elucidation from mass spectra.
Similar Papers
One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra
Machine Learning (CS)
**Finds new drug molecules from chemical fingerprints.**
Test-Time Tuned Language Models Enable End-to-end De Novo Molecular Structure Generation from MS/MS Spectra
Artificial Intelligence
Finds new chemicals from their broken pieces.
MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation
Machine Learning (CS)
Helps scientists identify unknown chemicals faster.