MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning Models
By: Soheil Zibakhsh Shabgahi, Yaman Jandali, Farinaz Koushanfar
Potential Business Impact:
Protects AI from hidden sabotage.
This paper proposes MergeGuard, a novel methodology for mitigation of AI Trojan attacks. Trojan attacks on AI models cause inputs embedded with triggers to be misclassified to an adversary's target class, posing a significant threat to model usability trained by an untrusted third party. The core of MergeGuard is a new post-training methodology for linearizing and merging fully connected layers which we show simultaneously improves model generalizability and performance. Our Proof of Concept evaluation on Transformer models demonstrates that MergeGuard maintains model accuracy while decreasing trojan attack success rate, outperforming commonly used (post-training) Trojan mitigation by fine-tuning methodologies.
Similar Papers
Defending Unauthorized Model Merging via Dual-Stage Weight Protection
CV and Pattern Recognition
Stops others from stealing and breaking your AI models.
Do Not Merge My Model! Safeguarding Open-Source LLMs Against Unauthorized Model Merging
Cryptography and Security
Stops others from copying smart computer programs.
Do Not Merge My Model! Safeguarding Open-Source LLMs Against Unauthorized Model Merging
Cryptography and Security
Stops others from copying AI models unfairly.