TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses
By: Muhammad Taha Cheema , Abeer Aamir , Khawaja Gul Muhammad and more
Potential Business Impact:
Makes chatbots answer faster and cheaper.
Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult due to the personalized nature of chatbot interactions and the limited accuracy of semantic similarity search. To address this, we present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts. Through comprehensive evaluation, including user studies with side-by-side comparisons, satisfaction voting, as well as multi-agent LLM debates, we demonstrate that TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness. Our results across real-world datasets highlight TweakLLM as a scalable, resource-efficient caching solution for high-volume LLM deployments without compromising user experience.
Similar Papers
TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses
Machine Learning (CS)
Makes chatbots answer faster and cheaper.
MixLLM: Dynamic Routing in Mixed Large Language Models
Computation and Language
Smartly picks best AI for faster, cheaper answers.
Leveraging Uncertainty Estimation for Efficient LLM Routing
Networking and Internet Architecture
Makes AI give better answers for less money.