MalRAG: A Retrieval-Augmented LLM Framework for Open-set Malicious Traffic Identification
By: Xiang Luo , Chang Liu , Gang Xiong and more
Potential Business Impact:
Finds new computer viruses automatically.
Fine-grained identification of IDS-flagged suspicious traffic is crucial in cybersecurity. In practice, cyber threats evolve continuously, making the discovery of novel malicious traffic a critical necessity as well as the identification of known classes. Recent studies have advanced this goal with deep models, but they often rely on task-specific architectures that limit transferability and require per-dataset tuning. In this paper we introduce MalRAG, the first LLM driven retrieval-augmented framework for open-set malicious traffic identification. MalRAG freezes the LLM and operates via comprehensive traffic knowledge construction, adaptive retrieval, and prompt engineering. Concretely, we construct a multi-view traffic database by mining prior malicious traffic from content, structural, and temporal perspectives. Furthermore, we introduce a Coverage-Enhanced Retrieval Algorithm that queries across these views to assemble the most probable candidates, thereby improving the inclusion of correct evidence. We then employ Traffic-Aware Adaptive Pruning to select a variable subset of these candidates based on traffic-aware similarity scores, suppressing incorrect matches and yielding reliable retrieved evidence. Moreover, we develop a suite of guidance prompts where task instruction, evidence referencing, and decision guidance are integrated with the retrieved evidence to improve LLM performance. Across diverse real-world datasets and settings, MalRAG delivers state-of-the-art results in both fine-grained identification of known classes and novel malicious traffic discovery. Ablation and deep-dive analyses further show that MalRAG effective leverages LLM capabilities yet achieves open-set malicious traffic identification without relying on a specific LLM.
Similar Papers
Who Stole Your Data? A Method for Detecting Unauthorized RAG Theft
Information Retrieval
Protects AI writing from being stolen.
Private-RAG: Answering Multiple Queries with LLMs while Keeping Your Data Private
Machine Learning (CS)
Keeps private information safe when computers answer questions.
Adapting Large Language Models to Emerging Cybersecurity using Retrieval Augmented Generation
Cryptography and Security
Helps computers spot new cyber threats faster.