Kernel-Based Evaluation of Conditional Biological Sequence Models
By: Pierre Glaser , Steffanie Paul , Alissa M. Hummer and more
Potential Business Impact:
Helps scientists check if computer models understand biology.
We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model's estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model's temperature hyperparameter to achieve a better fit.
Similar Papers
Fast and Scalable Score-Based Kernel Calibration Tests
Machine Learning (Stat)
Checks if computer predictions are trustworthy.
Demystify Protein Generation with Hierarchical Conditional Diffusion Models
Machine Learning (CS)
Designs new proteins that work as intended.
SCMD: A Kernel-Based Distance for Structural Causal Models to Quantify Transferability Across Environments
Statistics Theory
Measures how well AI works in new places.