Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback
By: Tom S. Juzek, Zina B. Ward
Potential Business Impact:
Fixes AI's wordy and repetitive writing habits.
Large Language Models (LLMs) are known to overuse certain terms like "delve" and "intricate." The exact reasons for these lexical choices, however, have been unclear. Using Meta's Llama model, this study investigates the contribution of Learning from Human Feedback (LHF), under which we subsume Reinforcement Learning from Human Feedback and Direct Preference Optimization. We present a straightforward procedure for detecting the lexical preferences of LLMs that are potentially LHF-induced. Next, we more conclusively link LHF to lexical overuse by experimentally emulating the LHF procedure and demonstrating that participants systematically prefer text variants that include certain words. This lexical overuse can be seen as a sort of misalignment, though our study highlights the potential divergence between the lexical expectations of different populations -- namely LHF workers versus LLM users. Our work contributes to the growing body of research on explainable artificial intelligence and emphasizes the importance of both data and procedural transparency in alignment research.
Similar Papers
Aligning to What? Limits to RLHF Based Alignment
Computation and Language
Fixes AI bias, but not perfectly yet.
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Machine Learning (Stat)
Makes AI understand what people want better.
RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders
Machine Learning (CS)
Makes online suggestions better by learning from your actions.