BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs
By: Nourah M Salem , Elizabeth White , Michael Bada and more
Potential Business Impact:
Helps computers understand medical writing better.
Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.
Similar Papers
Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs
Computation and Language
Computers can't always tell who "he" or "she" is.
Findings of the Fourth Shared Task on Multilingual Coreference Resolution: Can LLMs Dethrone Traditional Approaches?
Computation and Language
Helps computers understand who or what is being talked about.
CorefInst: Leveraging LLMs for Multilingual Coreference Resolution
Computation and Language
Helps computers understand who "he" or "she" is.