Making Large Language Models Efficient Dense Retrievers
By: Yibin Lei , Shwai He , Ang Li and more
Potential Business Impact:
Makes AI search faster and smaller.
Recent work has shown that directly fine-tuning large language models (LLMs) for dense retrieval yields strong performance, but their substantial parameter counts make them computationally inefficient. While prior studies have revealed significant layer redundancy in LLMs for generative tasks, it remains unclear whether similar redundancy exists when these models are adapted for retrieval tasks, which require encoding entire sequences into fixed representations rather than generating tokens iteratively. To this end, we conduct a comprehensive analysis of layer redundancy in LLM-based dense retrievers. We find that, in contrast to generative settings, MLP layers are substantially more prunable, while attention layers remain critical for semantic aggregation. Building on this insight, we propose EffiR, a framework for developing efficient retrievers that performs large-scale MLP compression through a coarse-to-fine strategy (coarse-grained depth reduction followed by fine-grained width reduction), combined with retrieval-specific fine-tuning. Across diverse BEIR datasets and LLM backbones, EffiR achieves substantial reductions in model size and inference cost while preserving the performance of full-size models.
Similar Papers
Benchmarking Information Retrieval Models on Complex Retrieval Tasks
Information Retrieval
Helps computers find information with tricky questions.
Think Before You Retrieve: Learning Test-Time Adaptive Search with Small Language Models
Artificial Intelligence
Teaches small computers to find information better.
When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection
Computation and Language
Checks if news is true, faster and better.