Score: 1

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Published: December 27, 2025 | arXiv ID: 2512.22671v1

By: Pere Martra

Potential Business Impact:

Makes AI better at following orders, not just facts.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model's ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.

Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study

Computation and Language

Makes smart AI programs smaller and faster.

28 Jul 2025 2

88%

Instruction-Following Pruning for Large Language Models

Computation and Language

Makes big AI models smaller, faster, and smarter.

3 Jan 2025 1

88%

Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

Computation and Language

Makes smart AI smaller without losing its thinking.

1 Dec 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

23 pages

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Makes AI better at following orders, not just facts.

Technical Abstract

Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study

Instruction-Following Pruning for Large Language Models

Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models