$D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness
By: Lang Xiong , Ning Liu , Ao Ren and more
Large language models (LLMs) face significant deployment challenges due to their massive computational demands. % While pruning offers a promising compression solution, existing methods suffer from two critical limitations: (1) They neglect activation distribution shifts between calibration data and test data, resulting in inaccurate error estimations; (2) They overlook the long-tail distribution characteristics of activations in the attention module. To address these limitations, this paper proposes a novel pruning method, $D^2Prune$. First, we propose a dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, leading to precise pruning mask selection and weight updating and facilitating error minimization during pruning. % Second, we propose an attention-aware dynamic update strategy that preserves the long-tail attention pattern by jointly minimizing the KL divergence of attention distributions and the reconstruction error. Extensive experiments show that $D^2Prune$ consistently outperforms SOTA methods across various LLMs (e.g., OPT-125M, LLaMA2/3, and Qwen3). Moreover, the dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.
Similar Papers
Dual-Priv Pruning : Efficient Differential Private Fine-Tuning in Multimodal Large Language Models
Cryptography and Security
Keeps AI's private info safe while learning.
D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning
CV and Pattern Recognition
Makes AI understand pictures faster, even small details.
Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning
Machine Learning (CS)
Makes AI understand long texts much faster.