Score: 3

FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning

Published: December 14, 2025 | arXiv ID: 2512.12756v1

By: Yue Jiang , Dingkang Yang , Minghao Han and more

Potential Business Impact:

Tests AI on seeing, hearing, and talking.

Business Areas:

Simulation Software

Despite rapid progress in multimodal large language models (MLLMs) and emerging omni-modal architectures, current benchmarks remain limited in scope and integration, suffering from incomplete modality coverage, restricted interaction to text-centric outputs, and weak interdependence and complementarity among modalities. To bridge these gaps, we introduce FysicsWorld, the first unified full-modality benchmark that supports bidirectional input-output across image, video, audio, and text, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning. FysicsWorld encompasses 16 primary tasks and 3,268 curated samples, aggregated from over 40 high-quality sources and covering a rich set of open-domain categories with diverse question types. We also propose the Cross-Modal Complementarity Screening (CMCS) strategy integrated in a systematic data construction framework that produces omni-modal data for spoken interaction and fusion-dependent cross-modal reasoning. Through a comprehensive evaluation of over 30 state-of-the-art baselines, spanning MLLMs, modality-specific models, unified understanding-generation models, and omni-modal language models, FysicsWorld exposes the performance disparities and limitations across models in understanding, generation, and reasoning. Our benchmark establishes a unified foundation and strong baselines for evaluating and advancing next-generation full-modality architectures.

Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems

Computation and Language

Tests AI on Chinese physics problems.

19 Sep 2025 2

89%

Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

CV and Pattern Recognition

Makes AI understand pictures and text better.

19 Nov 2025 2

88%

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

CV and Pattern Recognition

Tests if AI understands pictures, sound, and words equally.

16 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com huggingface.co

Page Count

21 pages

FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning

Tests AI on seeing, hearing, and talking.

Technical Abstract

Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems

Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models