Score: 2

HLTCOE Evaluation Team at TREC 2025: VQA Track

Published: December 8, 2025 | arXiv ID: 2512.07738v1

By: Dengjia Zhang , Charles Weng , Katherine Guerrerio and more

BigTech Affiliations: Johns Hopkins University

Potential Business Impact:

Makes computers answer questions about videos better.

Business Areas:
Image Recognition Data and Analytics, Software

The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.

Country of Origin
🇺🇸 United States

Repos / Data Links

Page Count
7 pages

Category
Computer Science:
CV and Pattern Recognition