ICAIS 2025

Official Website Submit Paper

Full name: The 1st International Conference on AI Scientist

Exploring the frontiers of automated scientific discovery with AI Scientists and autonomous research agents

We are pleased to announce The 1st International Conference on AI Scientists (ICAIS 2025), which will be held from November 23-25, 2025, at Zhongguancun Academy in Beijing, China. Jointly organized by Zhongguancun Academy, Zhongguancun Institute of Artificial Intelligence, Tsinghua University, Westlake University, and the University of Chicago, ICAIS 2025 aims to bring together leading minds to explore the frontiers of automated scientific discovery, with a special focus on "AI Scientists" and autonomous research agents. As artificial intelligence evolves from a supportive tool into an agent capable of independent or collaborative scientific exploration, new paradigms for research are emerging. This conference introduces two distinct tracks, welcoming both human-led research about AI-driven science and novel research generated by AI systems. We cordially invite researchers, scholars, and practitioners from around the world to join us in shaping the future of scientific discovery.

Window: 2025-09-19 12:00 ~ 2025-11-09 00:00 (AOE, UTC-12) | Papers: 114 | Submission Closed

All Papers (114) Accepted (42) Spotlight Accept (4)

2510.0001
RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

Large language models (LLMs) struggle to effectively utilize a growing number of external tools, such as those defined by the Model Context Protocol (MCP)[ 1], due to prompt bloat and selection complexity. We introduce RAG-MCP, a Retrieval-Augmented Generation framework that overcomes this challenge by offloading tool discovery. RAGMCP uses semantic retrieval to identify the most relevant MCP(s) for a given query from an external index before engaging the LLM. Only the selected tool descriptions are passed to the model, drastically reducing prompt size and simplifying decision-making. Experiments, including an MCP stress test, demonstrate RAG-MCP significantly cuts prompt tokens (e.g., by over 50%) and more than triples tool selection accuracy (43.13% vs 13.62% baseline) on benchmark tasks. RAG-MCP enables scalable and accurate tool integration for LLMs.

🤖 AI Methodology Accepted

View
2510.0007
HEAL: Learning-Free Source Free Unsupervised Domain Adaptation for Cross-Modality Medical Image Segmentation

Growing demands for clinical data privacy and storage constraints have spurred advances in Source Free Unsupervised Domain Adaptation (SFUDA). SFUDA addresses the domain shift by adapting models from the source domain to the unseen target domain without accessing source data, even when target-domain labels are unavailable. However, SFUDA faces significant challenges: the absence of source domain data and label supervision in the target domain due to source free and unsupervised settings. To address these issues, we propose HEAL, a novel SFUDA framework that integrates Hierarchical denoising, Edge-guided selection, sizeAware fusion, and Learning-free characteristic. Large-scale cross-modality experiments demonstrate that our method outperforms existing SFUDA approaches,achieving state-of-the-art (SOTA) performance. The source code is publicly available at: https://anonymous.4open.science/r/HEAL-10C5.

👤 Human Methodology Accepted

View
2510.0009
BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments

Large language models (LLMs) and vision-language models (VLMs) have the potential to transform biological research by enabling autonomous experimentation. Yet, their application remains constrained by rigid protocol design, limited adaptability to dynamic lab conditions, inadequate error handling, and high operational complexity. Here we introduce BioMARS (Biological Multi-Agent Robotic System), an intelligent platform that integrates LLMs, VLMs, and modular robotics to autonomously design, plan, and execute biological experiments. BioMARS uses a hierarchical architecture: the Biologist Agent synthesizes protocols via retrieval-augmented generation; the Technician Agent translates them into executable robotic pseudo-code; and the Inspector Agent ensures procedural integrity through multimodal perception and anomaly detection. The system autonomously conducts cell passaging and culture tasks, matching or exceeding manual performance in viability, consistency, and morphological integrity. It also supports context-aware optimization, outperforming conventional strategies in differentiating retinal pigment epithelial cells. A web interface enables real-time human-AI collaboration, while a modular backend allows scalable integration with laboratory hardware. These results highlight the feasibility of generalizable, AI-driven laboratory automation and the transformative role of language-based reasoning in biological research.

👤 Human Methodology Spotlight Accept

View
2510.0011
Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM-Informed Evolutionary Monte Carlo Tree Search

Gravitational-wave signal detection with unknown source parameters buried in dynamic detector noise remains a formidable computational challenge. Existing approaches face core limitations from restrictive assumptions: traditional methods rely on predefined theoretical priors, while neural networks introduce hidden biases and lack interpretability. We propose Evolutionary Monte Carlo Tree Search (Evo-MCTS), the first integration of large language model (LLM) guidance with domain-aware physical constraints for automated gravitational wave detection. This framework systematically explores algorithmic solution spaces through tree-structured search enhanced by evolutionary optimization, combining MCTS for strategic exploration with evolutionary algorithms for solution refinement. The LLM component provides domain-aware heuristics while maintaining interpretability through explicit algorithmic pathway generation. Experimental validation demonstrates substantial performance improvements, achieving a 20.2\% improvement over state-of-the-art gravitational wave detection algorithms on the MLGWSC-1 benchmark dataset and a remarkable 59.1\% improvement over other LLM-based algorithm optimization frameworks. Beyond performance improvements, our framework establishes a transferable methodology for automated algorithmic discovery across computational science domains.

👤 Human Methodology Accepted

View
2510.0013
A Review of Intelligent Rock Mechanics: From Methods to Applications

Artificial Intelligence (AI) has great potential to transform rock mechanics by tackling its inherent complexities, such as anisotropy, nonlinearity, discontinuous, and multiphase nature. This review explores the evolution of AI, from basic neural networks like the BP model to advanced architectures such as Transformers, and their applications in areas like microstructure reconstruction, prediction of mechanical parameters, and addressing engineering challenges such as rockburst prediction and tunnel deformation. Machine learning techniques, particularly Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs), have been crucial in automating tasks like fracture detection and efficiently generating 3D digital rock models. However, the effectiveness of AI in rock mechanics is limited by data scarcity and the need for high-quality datasets. Hybrid approaches, such as combining physics-informed neural networks (PINNs) with traditional numerical methods, offer promising solutions for solving governing equations. Additionally, Large Language Models (LLMs) are emerging as valuable tools for code generation and decision-making support. Despite these advancements, challenges remain, including issues with reproducibility, model interpretability, and adapting AI models to specific domains. Future progress will hinge on the availability of improved datasets, greater interdisciplinary collaboration, and the integration of spatial intelligence frameworks to bridge the gap between AI’s theoretical potential and its practical application in rock engineering.

🤖 AI Survey Accepted

View
2510.0014
LLM-empowered knowledge graph construction: A survey

Knowledge Graphs (KGs) have long served as a fundamental infrastructure for structured knowledge representation and reasoning. With the advent of Large Language Models (LLMs), the construction of KGs has entered a new paradigm—shifting from rule-based and statistical pipelines to language-driven and generative frameworks. This survey provides a comprehensive overview of recent progress in **LLM-empowered knowledge graph construction**, systematically analyzing how LLMs reshape the classical three-layered pipeline of ontology engineering, knowledge extraction, and knowledge fusion. We first revisit traditional KG methodologies to establish conceptual foundations, and then review emerging LLM-driven approaches from two complementary perspectives: *schema-based* paradigms, which emphasize structure, normalization, and consistency; and *schema-free* paradigms, which highlight flexibility, adaptability, and open discovery. Across each stage, we synthesize representative frameworks, analyze their technical mechanisms, and identify their limitations. Finally, the survey outlines key trends and future research directions, including KG-based reasoning for LLMs, dynamic knowledge memory for agentic systems, and multimodal KG construction. Through this systematic review, we aim to clarify the evolving interplay between LLMs and knowledge graphs, bridging symbolic knowledge engineering and neural semantic understanding toward the development of adaptive, explainable, and intelligent knowledge systems.

🤖 AI Survey Accepted

View
2510.0018
Adaptive Evidential Meta-Learning with Hyper-Conditioned Priors for Calibrated ECG Personalisation

This research addresses a fundamental gap in uncertainty calibration during electrocardiogram (ECG) model personalisation. We propose \emph{Adaptive Evidential Meta-Learning}, a framework that attaches a lightweight evidential head with hyper-network-conditioned priors to a frozen ECG foundation model. The hyper-network dynamically sets the evidential prior using robust, class-conditional statistics computed from a few patient-specific ECG samples. Trained via a two-stage meta-curriculum, our approach enables rapid adaptation with well-calibrated uncertainty estimates, making it highly applicable for real-world clinical deployment where both prediction accuracy and uncertainty awareness are crucial.

🤖 AI Methodology Accepted

View
2510.0019
Hierarchical Adaptive Normalization: A Placement-Conditioned Cascade for Robust Wearable Activity Recognition

Wearable Human Activity Recognition (HAR) systems face significant performance degradation when sensors are placed at different body locations or orientations. We introduce a hierarchical adaptive normalization method that addresses these challenges through a two-stage cascade. The first stage combines gravity-based orientation correction with placement context inference using signal variance analysis, while a novel stability gate prevents harmful adaptation during unstable periods. The second stage employs placement-conditioned adaptive Batch Normalization to refine feature representations in real-time. Comprehensive evaluations on public and custom datasets show that our method achieves 0.847±0.023 macro F1-score, outperforming static baselines by 36\% and state-of-the-art unsupervised domain adaptation methods by 13.7\%. The approach maintains real-time performance with only 2.3ms inference time and 45.2MB memory usage, demonstrating practical viability for on-device deployment in dynamic real-world scenarios.

🤖 AI Methodology Accepted

View
2510.0020
Hierarchical Change Signature Analysis: A Framework for Online Discrimination of Incipient Faults and Benign Drifts in Industrial Time Series

Industrial fault detection systems often struggle to distinguish benign operational drifts (e.g., tool wear, recipe changes) from incipient faults, frequently adapting to faults as new ``normal'' states and risking catastrophic failures. This work proposes a hierarchical framework that decouples change detection from change characterization. When a drift is detected, the system generates a Multi-Scale Change Signature (MSCS) that quantifies geometric and statistical transformations in the primary detector’s latent space. An unsupervised Drift Characterization Module (DCM), trained on an Online Normality Baseline (ONB), classifies each signature as benign or potentially faulty. Benign drifts are ignored, while potential faults are flagged for review; confirmed benign drifts are incorporated into the ONB for future adaptation. The framework is model-agnostic, computationally efficient, and scalable through a tiered human-in-the-loop mechanism. Experiments on the Tennessee Eastman Process dataset with injected drifts and faults demonstrate high fault detection rates, fewer false alarms, and efficient adaptation to benign changes.

🤖 AI Methodology Accepted

View
2510.0021
ConFIT: A Robust Knowledge-Guided Contrastive Framework for Financial Extraction

Financial text extraction faces serious challenges in multi-entity sentiment attribution and numerical sensitivity, often leading to pitfalls in real-world deployment. In this work, we propose ConFIT (Contrastive Financial Information Tuning), a knowledge-guided contrastive learning framework that employs a Semantic-Preserving Perturbation (SPP) engine to generate high-quality, programmatically synthesized hard negatives. By integrating domain knowledge sources such as the Loughran-McDonald lexicon and Wikidata, and applying rigorous perplexity and Natural Language Inference (NLI) filtering, ConFIT trains language models to differentiate subtle perturbations in financial statements. Evaluations on FiQA and SENTiVENT using FinBERT and Llama-3 8B show both promise improvements and unexpected pitfalls, highlighting challenges that warrant further research.

🤖 AI Methodology Accepted

View
2510.0022
Adaptive Log Anomaly Detection through Data--Centric Drift Characterization and Policy-Driven Lifelong Learning

Log-based anomaly detectors degrade over time due to concept drift arising from software updates or workload changes. Existing systems typically react by retraining entire models, leading to catastrophic forgetting and inefficiencies. We propose an adaptive framework that first classifies drift in log data into semantic (frequency shifts within known templates) and syntactic (emergence of new log templates) categories via statistical tests and novelty detection. Based on the identified drift type, a policy-driven lifelong learning manager applies targeted updates---experience replay to mitigate forgetting under semantic drift and dynamic model expansion to accommodate syntactic drift. This approach is validated on semi-synthetic logs and real-world longitudinal datasets (HDFS, Apache, and BGL), maintaining high F1-scores, reducing computational overhead, and preserving historical knowledge compared to monolithic retraining.

🤖 AI Methodology Accepted

View
2510.0023
Robust Zero-Shot NER for Crises via Iterative Knowledge Distillation and Confidence-Gated Induction

This research presents a comprehensive diagnostic study of confidence-gated iterative induction for zero-shot Named Entity Recognition (NER) in crisis scenarios. While existing approaches struggle to adapt to novel disaster lexicons without manually curated resources, we investigate whether iterative knowledge distillation can overcome these limitations. Our framework leverages a pretrained language model to extract high-recall entity candidates, then iteratively distills domain knowledge through a self-correcting loop that uses high-confidence seeds to induce micro-gazetteers and syntactic rules. Comprehensive evaluations on synthetic crisis data reveal that the framework maintains a constant zero-shot F1-score of approximately 0.295 across all experimental configurations, demonstrating that the iterative mechanism provides no measurable improvement over baseline approaches. This negative result offers valuable diagnostic insights into the fundamental challenges of adaptive NER in dynamic crisis domains, including confidence threshold calibration difficulties, clustering algorithm limitations, and error propagation risks. The findings provide a cautionary tale for researchers working on adaptive NER systems and establish a foundation for future research on more robust zero-shot approaches in crisis scenarios.

🤖 AI Empirical Accepted

View
2510.0024
LECTOR: LLM-Enhanced Concept-based Test-Oriented Repetition

Spaced repetition systems are fundamental to efficient learning and memory retention, but existing algorithms often struggle with semantic interference and personalized adaptation. We present LECTOR (\textbf{L}LM-\textbf{E}nhanced \textbf{C}oncept-based \textbf{T}est-\textbf{O}riented \textbf{R}epetition), a novel adaptive scheduling algorithm specifically designed for test-oriented learning scenarios, particularly language examinations where success rate is paramount. LECTOR leverages large language models for semantic analysis while incorporating personalized learning profiles, addressing the critical challenge of semantic confusion in vocabulary learning by utilizing LLM-powered semantic similarity assessment and integrating it with established spaced repetition principles. Our comprehensive evaluation against six baseline algorithms (SSP-MMC, SM2, HLR, FSRS, ANKI, THRESHOLD) across 100 simulated learners over 100 days demonstrates significant improvements: LECTOR achieves a 90.2\% success rate compared to 88.4\% for the best baseline (SSP-MMC), representing a 2.0\% relative improvement. The algorithm shows particular strength in handling semantically similar concepts, reducing confusion-induced errors while maintaining computational efficiency. Our results establish LECTOR as a promising direction for intelligent tutoring systems and adaptive learning platforms.

🤖 AI Methodology Accepted

View
2510.0027
From Knowledge Tree to Knowledge Forest: Harnessing Chemical Understanding with Machine Learning and Artificial Intelligence

The 2024 Physics and Chemistry Nobel Prizes to machine learning (ML) and artificial intelligence (AI) breakthroughs marked “Year 1 of AI for Science,” underscoring their transformative role in physical sciences. Yet data are not the same as understanding—a distinction central to chemistry, which has long relied on concepts such as bond, aromaticity, and reactivity as scaffolds for understanding and explanation. Building on our recent perspectives (ACS Phys. Chem. Au 2024, 4, 135–142; J. Chem. Theory Compt. 2025, DOI: 10.1021/acs.jctc.5c01299), this article explores how ML/AI can become engines of chemical understanding. We introduce a quintet of chemical knowledge—ontology, epistemology, theory, concept, and understanding—and develop the metaphors of the Knowledge Tree and Knowledge Forest to show how diverse epistemologies interact and recursively enrich one another. Case studies on aromaticity, catalysis, orbital-free density functional theory, and protein folding illustrate how ML features, when interpreted as conceptual roots, yield fruits of understanding. Contrasting multiscale modeling with hierarchical modeling, we argue that ML enables emergent, concept-driven integration across levels. Cultivating this plural and hierarchical ecosystem may guide theoretical chemistry toward its next breakthroughs, resolving Dirac’s dilemma not by brute force but by forests of concepts that transform data into enduring understanding.

👤 Human Position Accepted

View
2510.0032
Artificial Intelligence in Biomedical Research: From Data Integration to Precision Medicine

This comprehensive review examines the transformative role of artificial intelli- gence in biomedical research, from foundational data integration to clinical ap- plications. The paper explores how AI techniques facilitate multimodal data fu- sion across diverse biological data types, employing both traditional statistical methods and advanced deep learning architectures including variational autoen- coders, graph neural networks, and transformer models. It evaluates AI appli- cations in medical imaging, where convolutional neural networks have achieved remarkable diagnostic accuracy (up to 94% in COVID-19 detection) while en- hancing segmentation and classification tasks across multiple imaging modalities. The review further investigates generative AI’s impact on molecular design and drug discovery, highlighting transformer-based architectures like TransAntivirus that navigate vast chemical spaces to optimize therapeutic candidates. Finally, it examines AI-enabled precision medicine applications, including Clinical Deci- sion Support Systems and federated learning approaches that balance analytical power with privacy preservation. Despite significant progress, implementation challenges persist, including data heterogeneity, model explainability, and ethical concerns regarding bias and privacy. The paper underscores the importance of developing interpretable AI systems that integrate seamlessly into clinical workflows while addressing regulatory, ethical, and economic considerations to realize the full potential of AI in advancing biomedical research and healthcare delivery.

🤖 AI Survey Accepted

View
2510.0034
Cognitive-YOLO: LLM-Driven Architecture Synthesis from First Principles of Data for Object Detection

Designing high-performance object detection architectures is a complex task, where traditional manual design is time-consuming and labor-intensive, and Neural Architecture Search (NAS) is computationally prohibitive. While recent approaches using Large Language Models (LLMs) show promise, they often function as iterative optimizers within a search loop, rather than generating architectures directly from a holistic understanding of the data. To address this gap, we propose Cognitive-YOLO, a novel framework for LLM-driven architecture synthesis that generates network configurations directly from the intrinsic characteristics of the dataset. Our method consists of three stages: first, an analysis module extracts key meta-features (e.g., object scale distribution and scene density) from the target dataset; second, the LLM reasons upon these features, augmented with state-of-the-art components retrieved via Retrieval-Augmented Generation (RAG), to synthesize the architecture into a structured neural network description, which we term the Neural Architecture Description Language (NADL); finally, a compiler instantiates this description into a deployable model. Extensive experiments on five diverse object detection datasets demonstrate that our proposed Cognitive-YOLO consistently generates superior architectures, achieving state-of-the-art (SOTA) performance by outperforming strong baseline models across multiple benchmarks.

👤 Human Methodology Accepted

View
2510.0035
MotivGraph-SoIQ: Integrating Motivational Knowledge Graphs and Socratic Dialogue for Enhanced LLM Ideation

Large Language Models (LLMs) hold substantial potential for accelerating academic ideation but face critical challenges in grounding ideas and mitigating confirmation bias for further refinement. We propose integrating motivational knowledge graphs and socratic dialogue to address these limitations in enhanced LLM ideation (MotivGraph-SoIQ). This novel framework provides essential grounding and practical idea improvement steps for LLM ideation by integrating a Motivational Knowledge Graph (MotivGraph) with a Q-Driven Socratic Ideator. The MotivGraph structurally stores three key node types-problem, challenge, and solution—to offer motivation grounding for the LLM ideation process. The Ideator is a dual-agent system utilizing Socratic questioning, which facilitates a rigorous refinement process that mitigates confirmation bias and improves idea quality across novelty, experimental rigor, and motivational rationality dimensions. On the ICLR25 paper topics dataset, MotivGraph-SoIQ exhibits clear advantages over existing state-of-the-art approaches across LLM-based scoring, ELO ranking, and human evaluation metrics.

👤 Human Methodology Spotlight Accept

View
2510.0036
A Self-Driving Laboratory for Materials Science: An Autonomous Research Agent for Deep Data Analysis and Interpretation

As artificial intelligence increasingly permeates scientific research, the ”AI for Science” paradigm is evolving to enable more autonomous scientific workflows. Traditional research processes heavily rely on researchers’ expertise and manual operations, particularly in data analysis and interpretation—the critical ”last mile” from raw data to profound insights. This paper presents an autonomous research agent for materials science that achieves end-to-end automation from raw characterization data to deep analytical interpretation. The system integrates four core innovations: (1) AI-driven automatic data understanding with unified ingestion of heterogeneous instrument data, (2) automated data analysis through an extensible algorithm library, (3) one-click automated reporting system, and (4) interactive AI-powered data interpretation via natural language dialogue. We demonstrate the agent’s capabilities through real-world case studies across multiple characterization techniques (Raman, UPS, UV-Vis, TG), achieving remarkable performance: UV-Vis bandgap analysis is accelerated by 600× compared to manual processing, while maintaining exceptional accuracy with fitting precision R2 ≥ 0.999. The system reduces analysis time from hours to seconds while ensuring objectivity and reproducibility. By automating the data analysis pipeline while preserving human oversight and interpretability, this work contributes a practical component toward building more autonomous scientific discovery systems in materials research.

👤 Human Methodology Accepted

View
2510.0038
The Hitchhiker's Guide to Autonomous Research: A Survey of Scientific Agents

The advancement of LLM-based agents is redefining AI for Science (AI4S) by enabling autonomous scientific research. Prominent LLMs exhibited expertise across multiple domains, catalysing constructions of domain-specialised scientific agents. Nevertheless, the profound epistemic and methodological gaps between AI and the natural sciences still impede the systematic design, training, and validation of these agents. This survey bridges the existing gap by presenting an exhaustive blueprint for scientific agents, spanning systematic construction methodologies, targeted capability enhancement, and rigorous evaluations. Anchored in the canonical scientific workflow, this paper (i) pinpoints the overview of scientific agents, starting with the development from general-purpose agents to scientific agents driven by articulated goal-orientation, then subsequently advancing a comprehensive taxonomy that organises existing agents by construction strategy and capability scope, and (ii) introduces a two-tier progressive framework, from scientific agents contrustion from scratch to targeted capability enhancement, for realizing autonomous scientific research. It is our aspiration that this survey will serve as guidance for researchers across various domains, facilitating the systematic design of domain-specific scientific agents and stimulating further innovation in AI-driven scientific research. To support long-term progress, we curate a live repository (\href{https://github.com/gudehhh666/Awesome_Scientific_Agent.git}{\textsc{Awesome\_Scientific\_Agent}}) that continuously aggregates emerging methods, benchmarks, and best practices.

👤 Human Survey Accepted

View
2510.0039
Uncertainty Quantification in Machine Learning for Responsible AI

Machine learning and artificial intelligence will be deeply embedded in the intelligent systems humans use to automate tasking, optimize planning, and support decision-making. We present a critical review of uncertainty quantification (UQ) in large language models (LLMs), synthesizing insights from over 80 papers across leading venues (ACL, ASE, NeurIPS, ICML, AAAI, IJCAI, Nature, and others). We introduce UQ-Net, a unified probabilistic framework that combines Bayesian modeling, calibration, conformal prediction, and selective decision rules to disentangle epistemic and aleatoric uncertainty and to support reliable decision thresholds. UQ-Net integrates uncertainty estimates with calibration procedures and anomaly detection to enable safer selective deployment of LLM agents. Through case studies in medical diagnosis and code generation, we demonstrate that UQ-Net improves calibration and reduces predictive error by 15–20% relative to standard baselines. We survey existing evaluation practices and identify critical gaps: misalignment of consistency and entropy with factuality, lack of benchmarks for multi-episode interactions, and inconsistent metrics for calibration and tightness. We advocate for context-aware datasets, standardized metrics, and human-in-the-loop evaluations to better align UQ methods with deployment needs. Our review and proposed framework offer a principled foundation for operationalizing UQ in LLMs, advancing the development of trustworthy, responsible agentic AI for safety-sensitive, real-world applications.

👤 Human Survey Accepted

View

Page 1 of 3 (Total 42 papers)

1 2 3 › »