Score: 0

What you reward is what you learn: Comparing rewards for online speech policy optimization in public HRI

Published: January 5, 2026 | arXiv ID: 2601.01969v1

By: Sichao Song , Yuki Okafuji , Kaito Ariu and more

Potential Business Impact:

Robot learns to talk better with people.

Business Areas:
Robotics Hardware, Science and Engineering, Software

Designing policies that are both efficient and acceptable for conversational service robots in open and diverse environments is non-trivial. Unlike fixed, hand-tuned parameters, online learning can adapt to non-stationary conditions. In this paper, we study how to adapt a social robot's speech policy in the wild. During a 12-day in-situ deployment with over 1,400 public encounters, we cast online policy optimization as a multi-armed bandit problem and use Thompson sampling to select among six actions defined by speech rate (slow/normal/fast) and verbosity (concise/detailed). We compare three complementary binary rewards--Ru (user rating), Rc (conversation closure), and Rt (>=2 turns)--and show that each induces distinct arm distributions and interaction behaviors. We complement the online results with offline evaluations that analyze contextual factors (e.g., crowd level, group size) using video-annotated data. Taken together, we distill ready-to-use design lessons for deploying online optimization of speech policies in real public HRI settings.

Country of Origin
🇺🇸 United States

Page Count
12 pages

Category
Computer Science:
Robotics