Researchers from the Indian Institute of Technology (IIT) Delhi and Friedrich Schiller University Jena (FSU Jena), Germany have found that while leading artificial intelligence (AI) models excel at basic tasks, they lack scientific reasoning. The study, which was published in Nature Computational Science, revealed that current AI models show promise in simple scientific tasks but have fundamental limitations that could be risky if they're used in research settings without human oversight.
The research team, led by NM Anoop Krishnan, an associate professor at IIT Delhi, and Kevin Maik Jablonka, a professor at Friedrich Schiller University Jena, developed "MaCBench". It's the first comprehensive benchmark specifically designed to evaluate how vision-language models handle real-world chemistry and materials science tasks.
The results revealed a notable contradiction: while AI models achieved nearly perfect scores on basic tasks such as equipment identification, they encountered difficulties with more complex tasks essential for true scientific discovery. These complex tasks include spatial reasoning, multi-step logical inference, and the synthesis of information across different modalities.
"Our findings represent a crucial reality check for the scientific community. While these AI systems show remarkable capabilities in routine data processing tasks, they are not yet ready for autonomous scientific reasoning," Krishnan told PTI. "The strong correlation we observed between model performance and internet data availability suggests these systems may be relying more on pattern matching than genuine scientific understanding".
Safety and reasoning gaps
One of the most concerning findings was revealed during laboratory safety assessments, Krishnan explained.
"While models excelled at identifying laboratory equipment with 77per cent accuracy, they performed poorly when evaluating safety hazards in similar laboratory setups, achieving only 46per cent accuracy," said Jablonka. "This disparity between equipment recognition and safety reasoning is particularly alarming".
Jablonka noted that this suggests current AI models can't fill the gaps in the tacit knowledge that's critical for safe laboratory operations. "Scientists must understand these limitations before integrating AI into safety-critical research environments," he added.
The research team's innovative approach included extensive ablation studies that isolated specific failure modes. They found that the models did much better when the same information was presented as text instead of images, which shows incomplete multimodal integration—a basic requirement for scientific work.
Implications for the future of AI in Science
The study's findings go beyond chemistry and materials science. They suggest broader challenges for deploying AI across all scientific disciplines. It also shows that developing reliable AI assistants will require fundamental advances in training methods that prioritise genuine understanding over pattern matching.
"Our work provides a roadmap for both the capabilities and limitations of current AI systems in science," said Indrajeet Mandal, an IIT Delhi PhD scholar. "While these models show promise as assistive tools for routine tasks, human oversight remains essential for complex reasoning and safety-critical decisions. The path forward requires better uncertainty quantification and frameworks for effective human-AI collaboration".
ALSO READ: Airtel re-KYC alert: How to complete your re-verification online in easy steps