PolySet: Restoring the Statistical Ensemble Nature of Polymers for Machine Learning
By: Khalid Ferji
Machine-learning (ML) models in polymer science typically treat a polymer as a single, perfectly defined molecular graph, even though real materials consist of stochastic ensembles of chains with distributed lengths. This mismatch between physical reality and digital representation limits the ability of current models to capture polymer behaviour. Here we introduce PolySet, a framework that represents a polymer as a finite, weighted ensemble of chains sampled from an assumed molar-mass distribution. This ensemble-based encoding is independent of chemical detail, compatible with any molecular representation and illustrated here in the homopolymer case using a minimal language model. We show that PolySet retains higher-order distributional moments (such as Mz, Mz+1), enabling ML models to learn tail-sensitive properties with greatly improved stability and accuracy. By explicitly acknowledging the statistical nature of polymer matter, PolySet establishes a physically grounded foundation for future polymer machine learning, naturally extensible to copolymers, block architectures, and other complex topologies.
Similar Papers
Open Polymer Challenge: Post-Competition Report
Machine Learning (CS)
Finds new plastic materials faster.
Learning Repetition-Invariant Representations for Polymer Informatics
Machine Learning (CS)
Helps computers understand plastic chains of any length.
Machine learning surrogate models of many-body dispersion interactions in polymer melts
Machine Learning (CS)
Predicts how molecules stick together much faster.