Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning
By: Shaun Baek , Shaun Esua-Mensah , Cyrus Tsui and more
Potential Business Impact:
Teaches computers to think logically and solve problems.
Large Language Models (LLMs) are primarily trained on high-resource natural languages, limiting their effectiveness in low-resource settings and in tasks requiring deep logical reasoning. This research introduces Rosetta-PL, a benchmark designed to evaluate LLMs' logical reasoning and generalization capabilities in a controlled environment. We construct Rosetta-PL by translating a dataset of logical propositions from Lean into a custom logical language, which is then used to fine-tune an LLM (e.g., GPT-4o). Our experiments analyze the impact of the size of the dataset and the translation methodology on the performance of the model. Our results indicate that preserving logical relationships in the translation process significantly boosts precision, with accuracy plateauing beyond roughly 20,000 training samples. These insights provide valuable guidelines for optimizing LLM training in formal reasoning tasks and improving performance in various low-resource language applications.
Similar Papers
Can Large Language Models Learn Formal Logic? A Data-Driven Training and Evaluation Framework
Machine Learning (CS)
Teaches computers to prove math problems correctly.
LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLMs
Artificial Intelligence
Tests how well computers can plan and think logically.
Reasoning Capabilities and Invariability of Large Language Models
Computation and Language
Tests if computers can think logically.