The PLLuM Instruction Corpus
By: Piotr Pęzik , Filip Żarnecki , Konrad Kaczyński and more
Potential Business Impact:
Teaches computers to understand and write Polish.
This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.
Similar Papers
PLLuM: A Family of Polish Large Language Models
Computation and Language
Makes computers understand and speak Polish better.
Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque
Computation and Language
Teaches computers new languages with less data.
InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
Machine Learning (CS)
Helps computers talk in rare languages.