Score: 1

Evaluation Guidelines for Empirical Studies in Software Engineering involving LLMs

Published: August 21, 2025 | arXiv ID: 2508.15503v1

By: Sebastian Baltes , Florian Angermeir , Chetan Arora and more

Potential Business Impact:

Makes computer research with AI easier to check.

Business Areas:

Simulation Software

Large language models (LLMs) are increasingly being integrated into software engineering (SE) research and practice, yet their non-determinism, opaque training data, and evolving architectures complicate the reproduction and replication of empirical studies. We present a community effort to scope this space, introducing a taxonomy of LLM-based study types together with eight guidelines for designing and reporting empirical studies involving LLMs. The guidelines present essential (must) criteria as well as desired (should) criteria and target transparency throughout the research process. Our recommendations, contextualized by our study types, are: (1) to declare LLM usage and role; (2) to report model versions, configurations, and fine-tuning; (3) to document tool architectures; (4) to disclose prompts and interaction logs; (5) to use human validation; (6) to employ an open LLM as a baseline; (7) to report suitable baselines, benchmarks, and metrics; and (8) to openly articulate limitations and mitigations. Our goal is to enable reproducibility and replicability despite LLM-specific barriers to open science. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines.org).

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

Software Engineering

Makes computer studies easier to check and repeat.

21 Aug 2025 2

99%

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

Software Engineering

Makes computer studies easier to check and repeat.

21 Aug 2025 2

92%

Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies

Software Engineering

Makes computer learning results easier to check.

29 Oct 2025 2

View PDF Login to Bookmark

Country of Origin

🇸🇪 🇧🇷 🇦🇺 🇸🇬 🇮🇹 🇩🇪 🇩🇰 🇮🇪 🇨🇦 Italy, Singapore, Denmark, Brazil, Germany, Sweden, Ireland, Australia, Canada

Page Count

24 pages

Evaluation Guidelines for Empirical Studies in Software Engineering involving LLMs

Makes computer research with AI easier to check.

Technical Abstract

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies