Bench360: Benchmarking Local LLM Inference from 360 Degrees
By: Linus Stuhlmann, Mauricio Fadel Argerich, Jonathan Fürst
Potential Business Impact:
Tests computer brains for best speed and smarts.
Running LLMs locally has become increasingly common, but users face a complex design space across models, quantization levels, inference engines, and serving scenarios. Existing inference benchmarks are fragmented and focus on isolated goals, offering little guidance for practical deployments. We present Bench360, a framework for evaluating local LLM inference across tasks, usage patterns, and system metrics in one place. Bench360 supports custom tasks, integrates multiple inference engines and quantization formats, and reports both task quality and system behavior (latency, throughput, energy, startup time). We demonstrate it on four NLP tasks across three GPUs and four engines, showing how design choices shape efficiency and output quality. Results confirm that tradeoffs are substantial and configuration choices depend on specific workloads and constraints. There is no universal best option, underscoring the need for comprehensive, deployment-oriented benchmarks.
Similar Papers
Bench360: Benchmarking Local LLM Inference from 360°
Computation and Language
Helps pick best computer settings for AI.
LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning
Computation and Language
Tests computers' knowledge of small towns.
LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning
Computation and Language
Tests if AI knows about small towns.