Benchmarking LLM Agents for Wealth-Management Workflows
By: Rory Milsom
Potential Business Impact:
Lets AI assistants manage money tasks reliably.
Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAgentCompany with a finance-focused environment and investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically. This study introduces synthetic domain data, enriches colleague simulations, and prototypes an automatic task-generation pipeline. The study aims to create and assess an evaluation set that can meaningfully measure an agent's fitness for assistant-level wealth management work. We construct a benchmark of 12 task-pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication, with explicit acceptance criteria and deterministic graders. We seeded a set of new finance-specific data and introduced a high vs. low-autonomy variant of every task. The paper concluded that agents are limited less by mathematical reasoning and more so by end-to-end workflow reliability, and meaningfully affected by autonomy level, and that incorrect evaluation of models have hindered benchmarking.
Similar Papers
Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks
Computational Engineering, Finance, and Science
Tests AI on real money problems, finds big gaps
Are Generative AI Agents Effective Personalized Financial Advisors?
Artificial Intelligence
Helps AI give better money advice, but users trust it too much.
AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets
Computational Finance
Tests if AI can make smart money trades live.