Score: 0

CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance

Published: December 10, 2025 | arXiv ID: 2512.09506v1

By: Jinru Ding , Chao Ding , Wenrao Pang and more

Large language models are increasingly deployed across the financial sector for tasks such as research, compliance, risk analysis, and customer service, which makes rigorous safety evaluation essential. However, existing financial benchmarks primarily focus on textbook-style question answering and numerical problem solving, but fail to evaluate models' real-world safety behaviors. They weakly assess regulatory compliance and investor-protection norms, rarely stress-test multi-turn adversarial tactics such as jailbreaks or prompt injection, inconsistently ground answers in long filings, ignore tool- or RAG-induced over-reach risks, and rely on opaque or non-auditable evaluation protocols. To close these gaps, we introduce CNFinBench, a benchmark that employs finance-tailored red-team dialogues and is structured around a Capability-Compliance-Safety triad, including evidence-grounded reasoning over long reports and jurisdiction-aware rule/tax compliance tasks. For systematic safety quantification, we introduce the Harmful Instruction Compliance Score (HICS) to measure how consistently models resist harmful prompts across multi-turn adversarial dialogues. To ensure auditability, CNFinBench enforces strict output formats with dynamic option perturbation for objective tasks and employs a hybrid LLM-ensemble plus human-calibrated judge for open-ended evaluations. Experiments on 21 models across 15 subtasks confirm a persistent capability-compliance gap: models achieve an average score of 61.0 on capability tasks but fall to 34.18 on compliance and risk-control evaluations. Under multi-turn adversarial dialogue tests, most systems reach only partial resistance (HICS 60-79), demonstrating that refusal alone is not a reliable proxy for safety without cited and verifiable reasoning.

SafeLawBench: Towards Safe Alignment of Large Language Models

Computation and Language

Tests AI for safe and legal answers.

7 Jun 2025 2

89%

OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

Machine Learning (CS)

Tests AI for harmful text, images, audio, and video.

13 Nov 2025 1

88%

SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models

Artificial Intelligence

Tests AI reasoning for hidden dangers.

19 Nov 2025 1

View PDF Login to Bookmark

CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance

Technical Abstract

SafeLawBench: Towards Safe Alignment of Large Language Models

OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models