HLTCOE Evaluation Team at TREC 2025: VQA Track
By: Dengjia Zhang , Charles Weng , Katherine Guerrerio and more
Potential Business Impact:
Makes computers answer questions about videos better.
The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.
Similar Papers
Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training
Computation and Language
Makes AI understand questions better, not just guess.
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
CV and Pattern Recognition
Helps AI answer hard questions using extra facts.
Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints
CV and Pattern Recognition
Helps computers see plants better, not guess.