KinForm: Kinetics Informed Feature Optimised Representation Models for Enzyme $k_{cat}$ and $K_{M}$ Prediction
By: Saleh Alwer, Ronan Fleming
Potential Business Impact:
Helps predict how fast tiny body helpers work.
Kinetic parameters such as the turnover number ($k_{cat}$) and Michaelis constant ($K_{\mathrm{M}}$) are essential for modelling enzymatic activity but experimental data remains limited in scale and diversity. Previous methods for predicting enzyme kinetics typically use mean-pooled residue embeddings from a single protein language model to represent the protein. We present KinForm, a machine learning framework designed to improve predictive accuracy and generalisation for kinetic parameters by optimising protein feature representations. KinForm combines several residue-level embeddings (Evolutionary Scale Modeling Cambrian, Evolutionary Scale Modeling 2, and ProtT5-XL-UniRef50), taken from empirically selected intermediate transformer layers and applies weighted pooling based on per-residue binding-site probability. To counter the resulting high dimensionality, we apply dimensionality reduction using principal--component analysis (PCA) on concatenated protein features, and rebalance the training data via a similarity-based oversampling strategy. KinForm outperforms baseline methods on two benchmark datasets. Improvements are most pronounced in low sequence similarity bins. We observe improvements from binding-site probability pooling, intermediate-layer selection, PCA, and oversampling of low-identity proteins. We also find that removing sequence overlap between folds provides a more realistic evaluation of generalisation and should be the standard over random splitting when benchmarking kinetic prediction models.
Similar Papers
Pseudodata-guided Invariant Representation Learning Boosts the Out-of-Distribution Generalization in Enzymatic Kinetic Parameter Prediction
Machine Learning (CS)
Makes computer enzyme predictions more reliable.
Multimodal Regression for Enzyme Turnover Rates Prediction
Machine Learning (CS)
Predicts how fast enzymes work using AI.
Protein Structure-Function Relationship: A Kernel-PCA Approach for Reaction Coordinate Identification
Computation and Language
Finds how proteins work by looking at their moves.