MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
By: Mykhailo Poliakov, Nadiya Shvai
Potential Business Impact:
Teaches computers to spot fake health news.
Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.
Similar Papers
Enhancing Health Fact-Checking with LLM-Generated Synthetic Data
Artificial Intelligence
Makes online health advice more trustworthy.
Bias-Corrected Data Synthesis for Imbalanced Learning
Machine Learning (Stat)
Fixes computer guessing when most examples are wrong.
MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation
Computation and Language
Makes AI smarter for specific jobs.