Score: 1

The PLLuM Instruction Corpus

Published: November 21, 2025 | arXiv ID: 2511.17161v1

By: Piotr Pęzik , Filip Żarnecki , Konrad Kaczyński and more

Potential Business Impact:

Teaches computers to understand and write Polish.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.

Country of Origin
🇵🇱 Poland


Page Count
35 pages

Category
Computer Science:
Computation and Language