Score: 1

MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning Models

Published: May 6, 2025 | arXiv ID: 2505.04015v1

By: Soheil Zibakhsh Shabgahi, Yaman Jandali, Farinaz Koushanfar

Potential Business Impact:

Protects AI from hidden sabotage.

Business Areas:
Intrusion Detection Information Technology, Privacy and Security

This paper proposes MergeGuard, a novel methodology for mitigation of AI Trojan attacks. Trojan attacks on AI models cause inputs embedded with triggers to be misclassified to an adversary's target class, posing a significant threat to model usability trained by an untrusted third party. The core of MergeGuard is a new post-training methodology for linearizing and merging fully connected layers which we show simultaneously improves model generalizability and performance. Our Proof of Concept evaluation on Transformer models demonstrates that MergeGuard maintains model accuracy while decreasing trojan attack success rate, outperforming commonly used (post-training) Trojan mitigation by fine-tuning methodologies.

Country of Origin
🇺🇸 United States

Repos / Data Links

Page Count
7 pages

Category
Computer Science:
Cryptography and Security