Score: 2

One Small Step with Fingerprints, One Giant Leap for emph{De Novo} Molecule Generation from Mass Spectra

Published: August 6, 2025 | arXiv ID: 2508.04180v1

By: Neng Kai Nigel Neo , Lim Jing , Ngoui Yong Zhau Preston and more

Potential Business Impact:

Finds new molecules from chemical fingerprints.

A common approach to the \emph{de novo} molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt \textsc{MIST}~\citep{MISTgoldmanAnnotatingMetaboliteMass2023} as the encoder and \textsc{MolForge}~\citep{ucakReconstructionLosslessMolecular2023} as the decoder, leveraging pretraining to enhance performance. Notably, pretraining \textsc{MolForge} proves especially effective, enabling it to serve as a robust fingerprint-to-structure decoder. Additionally, instead of passing the probability of each bit in the fingerprint, thresholding the probabilities as a step function helps focus the decoder on the presence of substructures, improving recovery of accurate molecular structures even when the fingerprints predicted by \textsc{MIST} only moderately resembles the ground truth in terms of Tanimoto similarity. This combination of encoder and decoder results in a tenfold improvement over previous state-of-the-art methods, generating top-1 28\% / top-10 36\% of molecular structures correctly from mass spectra. We position this pipeline as a strong baseline for future research in \emph{de novo} molecule elucidation from mass spectra.

Repos / Data Links

Page Count
8 pages

Category
Computer Science:
Machine Learning (CS)