Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments
By: Saeideh Yousefzadeh, Hamidreza Pourreza
Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and -- critically -- produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.
Similar Papers
TextInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification
CV and Pattern Recognition
Helps robots remember indoor places using signs.
Towards Test-time Efficient Visual Place Recognition via Asymmetric Query Processing
CV and Pattern Recognition
Lets phones find places using less power.
Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click
CV and Pattern Recognition
Lets you tell computers what's happening in videos.