Prompt perturbation and fraction facilitation sometimes strengthen Large Language Model scores
By: Mike Thelwall
Potential Business Impact:
Helps computers judge research quality better.
Large Language Models (LLMs) can be tasked with scoring texts according to pre-defined criteria and on a defined scale, but there is no recognised optimal prompting strategy for this. This article focuses on the task of LLMs scoring journal articles for research quality on a four-point scale, testing how user prompt design can enhance this ability. Based primarily on 1.7 million Gemma3 27b queries for 2780 health and life science articles with 58 similar prompts, the results show that improvements can be obtained by (a) testing semantically equivalent prompt variations, (b) averaging scores from semantically equivalent prompts, (c) specifying that fractional scores are allowed, and possibly also (d) not drawing attention to the input being partial. Whilst (a) and (d) suggests that models can be sensitive to how a task is phrased, (b) and (c) suggest that strategies to leverage more of the model's knowledge are helpful, such as by perturbing prompts and facilitating fractions. Perhaps counterintuitively, encouraging incorrect answers (fractions for this task) releases useful information about the model's certainty about its answers. Mixing semantically equivalent prompts also reduces the chance of getting no score for an input. Additional testing showed that the best prompts vary between LLMs, however, and were almost the opposite for ChatGPT 4o-mini, weakly aligned for Llama4 Scout and Magistral, and made little difference to Qwen3 32b and DeepSeek R1 32b. Overall, whilst there is no single best prompt, a good strategy for all models was to average the scores from a range of different semantically equivalent or similar prompts.
Similar Papers
Does the Prompt-based Large Language Model Recognize Students' Demographics and Introduce Bias in Essay Scoring?
Computation and Language
AI writing grader unfairly scores non-native speakers.
Beyond Correctness: Evaluating and Improving LLM Feedback in Statistical Education
Other Statistics
Helps teachers give better feedback to students.
The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance
Artificial Intelligence
Teaches AI to understand pictures and words better.