Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows
By: Wanghan Xu , Yuhao Zhou , Yifan Zhou and more
Potential Business Impact:
AI learns to do science like a human.
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
Similar Papers
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans
CV and Pattern Recognition
Tests if AI can think like people.
Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Artificial Intelligence
Tests AI's ability to do science like a human.
Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches
Artificial Intelligence
Teaches AI to think and learn like humans.