Score: 1

Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval

Published: October 23, 2025 | arXiv ID: 2510.20393v1

By: Qing Wang , Chong-Wah Ngo , Yu Cao and more

Potential Business Impact:

Finds recipes even when pictures hide details.

Business Areas:

Recipes Food and Beverage

Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these elements into cross-modal representation learning to mitigate biases. Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset. The results indicate that the proposed causal representation learning is capable of uncovering subtle ingredients and cooking actions and achieves impressive retrieval performance on both monolingual and multilingual multicultural datasets.

Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

CV and Pattern Recognition

Find recipes from food pictures better.

19 Nov 2025 1

88%

LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets

CV and Pattern Recognition

Lets phones know what food you're eating.

20 Nov 2025 0

88%

Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes

Artificial Intelligence

Helps computers understand pictures and words equally.

26 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇸🇬 Singapore

Repos / Data Links

github.com

Page Count

18 pages

Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval

Finds recipes even when pictures hide details.

Technical Abstract

Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets

Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes