Score: 1

Bench360: Benchmarking Local LLM Inference from 360 Degrees

Published: November 12, 2025 | arXiv ID: 2511.16682v2

By: Linus Stuhlmann, Mauricio Fadel Argerich, Jonathan Fürst

Potential Business Impact:

Tests computer brains for best speed and smarts.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Running LLMs locally has become increasingly common, but users face a complex design space across models, quantization levels, inference engines, and serving scenarios. Existing inference benchmarks are fragmented and focus on isolated goals, offering little guidance for practical deployments. We present Bench360, a framework for evaluating local LLM inference across tasks, usage patterns, and system metrics in one place. Bench360 supports custom tasks, integrates multiple inference engines and quantization formats, and reports both task quality and system behavior (latency, throughput, energy, startup time). We demonstrate it on four NLP tasks across three GPUs and four engines, showing how design choices shape efficiency and output quality. Results confirm that tradeoffs are substantial and configuration choices depend on specific workloads and constraints. There is no universal best option, underscoring the need for comprehensive, deployment-oriented benchmarks.