Zero-shot data citation function classification using transformer-based large language models (LLMs)
By: Neil Byers , Ali Zaidi , Valerie Skye and more
Potential Business Impact:
Helps understand how science papers use data.
Efforts have increased in recent years to identify associations between specific datasets and the scientific literature that incorporates them. Knowing that a given publication cites a given dataset, the next logical step is to explore how or why that data was used. Advances in recent years with pretrained, transformer-based large language models (LLMs) offer potential means for scaling the description of data use cases in the published literature. This avoids expensive manual labeling and the development of training datasets for classical machine-learning (ML) systems. In this work we apply an open-source LLM, Llama 3.1-405B, to generate structured data use case labels for publications known to incorporate specific genomic datasets. We also introduce a novel evaluation framework for determining the efficacy of our methods. Our results demonstrate that the stock model can achieve an F1 score of .674 on a zero-shot data citation classification task with no previously defined categories. While promising, our results are qualified by barriers related to data availability, prompt overfitting, computational infrastructure, and the expense required to conduct responsible performance evaluation.
Similar Papers
Document Attribution: Examining Citation Relationships using Large Language Models
Information Retrieval
Checks if AI answers come from the right documents.
Zero-Shot Document-Level Biomedical Relation Extraction via Scenario-based Prompt Design in Two-Stage with LLM
Neural and Evolutionary Computing
Helps computers find health facts without human work.
Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky
Computation and Language
Finds hidden car crash causes in police reports.