Score: 1

Experts are all you need: A Composable Framework for Large Language Model Inference

Published: November 28, 2025 | arXiv ID: 2511.22955v1

By: Shrihari Sridharan , Sourjya Roy , Anand Raghunathan and more

Potential Business Impact:

Makes AI smarter and faster by teamwork.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLMs) have achieved state-of-the-art accuracies in a variety of natural language processing (NLP) tasks. However, this success comes at the cost of increased model sizes which leads to additional computational burden. Mixture of Experts (MoEs) overcome this bottleneck by decoupling model capacity from computation by only activating a subset of parameters or "experts". However, these models require joint pretraining of these experts along with the router and do not model multi-step reasoning. In contrast, multi-agent frameworks improve reasoning by decomposing complex problems into modular subtasks. However, these frameworks rely on sequential "plan--act--observe" loops, which introduce significant latency. Our work, Comp-LLM, addresses these challenges by introducing a composable inference framework that enables cross-expert collaboration via an explicit sub-query dependency graph. Comp-LLM consists of three components: (1) A Sub-query Generator that decomposes an input query, assigns each sub-query to an appropriate expert using embedding similarity, and constructs a dependency graph; (2) A Query Executor that processes nodes in the graph and identifies opportunities for parallelism based on dependencies and resource constraints; and (3) A Response Aggregator that synthesizes intermediate expert responses into a coherent final answer. Across several benchmarks, Comp-LLM achieves up to 11.01% accuracy improvement over monolithic LLMs of similar size, while offering 1.67x--3.56x reduction in model size with no significant degradation relative to the largest model in its family. Additionally, Comp-LLM provides 1.1x--1.7x latency improvement compared to sequential sub-query processing.

Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

Computation and Language

Makes AI smarter, faster, and use less memory.

27 Sep 2025 0

90%

Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

Computation and Language

Makes smart computer answers faster and cheaper.

31 Dec 2025 0

90%

Leveraging the true depth of LLMs

Machine Learning (CS)

Makes AI answer questions faster without losing smarts.

5 Feb 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

16 pages

Experts are all you need: A Composable Framework for Large Language Model Inference

Makes AI smarter and faster by teamwork.

Technical Abstract

Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

Leveraging the true depth of LLMs