CTI-HAL: A Human-Annotated Dataset for Cyber Threat Intelligence Analysis
By: Sofia Della Penna , Roberto Natella , Vittorio Orbinato and more
Potential Business Impact:
Helps computers understand online threats faster.
Organizations are increasingly targeted by Advanced Persistent Threats (APTs), which involve complex, multi-stage tactics and diverse techniques. Cyber Threat Intelligence (CTI) sources, such as incident reports and security blogs, provide valuable insights, but are often unstructured and in natural language, making it difficult to automatically extract information. Recent studies have explored the use of AI to perform automatic extraction from CTI data, leveraging existing CTI datasets for performance evaluation and fine-tuning. However, they present challenges and limitations that impact their effectiveness. To overcome these issues, we introduce a novel dataset manually constructed from CTI reports and structured according to the MITRE ATT&CK framework. To assess its quality, we conducted an inter-annotator agreement study using Krippendorff alpha, confirming its reliability. Furthermore, the dataset was used to evaluate a Large Language Model (LLM) in a real-world business context, showing promising generalizability.
Similar Papers
CTI Dataset Construction from Telegram
Cryptography and Security
Finds online dangers from chat messages.
Enabling Transparent Cyber Threat Intelligence Combining Large Language Models and Domain Ontologies
Cryptography and Security
Helps computers find bad guys in computer logs.
Towards Effective Identification of Attack Techniques in Cyber Threat Intelligence Reports using Large Language Models
Cryptography and Security
Helps computers find online dangers faster.