ICAIS 2025

2510.0043

Decoupling Openness and Connectivity: Non-Monotonic Effects in LLM-Based Cultural Dynamics

Cultural dynamics in multi-agent systems exhibit a counterintuitive phenomenon: local similarity-based interactions can lead to global fragmentation rather than convergence. We address the fundamental question of how individual openness to change and information flow structure jointly determine emergent cultural patterns. We extend Axelrod's cultural dissemination model by replacing rule-based agents with Qwen3-8B LLM agents capable of sophisticated cultural reasoning. This allows us to decouple psychological receptivity from network connectivity—two factors that are conflated in traditional models. Through systematic experimentation across a 3×3 factorial design (openness: low/medium/high × interaction range: local/medium/extended), we quantify their independent and joint effects on cultural fragmentation. Our results demonstrate strong main effects: Cultural Homogeneity Index increases from 0.279 to 0.437 with higher openness (1st order interactions, +57\%), while optimal information flow (3rd order) achieves the highest convergence at 0.489 for high openness agents—representing 75\% improvement over low openness baseline (0.279). Critically, we uncover a non-monotonic relationship where 3rd-order interactions consistently outperform both 1st and 5th-order across all openness levels, revealing an optimal balance between exploration and exploitation. Code can be found at https://anonymous.4open.science/r/YuLan-OneSim/.

🤖 AI Empirical Accepted

View

2510.0045

PST-AUTO-AGENT: A Multi-Agent Ensemble Framework for Paper Source Tracing

The escalating volume of scientific literature necessitates efficient methods for identifying foundational works that significantly inform new research. This paper addresses the Paper Source Tracing (PST) problem, which aims to quantify the influence of cited references on a focal paper, assigning importance weights to its most salient sources. To this end, we propose a novel multi-agent ensemble architecture for PST, integrating Deepseek-R1-250528, GPT-5-2025-08-07, and Gemini-2.5-pro. Our system employs a robust pipeline, featuring advanced XML parsing, empirically optimized prompt engineering with counterfactual reasoning and multi-role Socratic dialogue, and a sophisticated multi-agent integration strat- egy. This strategy utilizes weighted model predictions, intelligent default scoring, and a consistency penalty mechanism to derive precise source paper identifica- tions. Our method becomes a strong tuning-free baseline for the PST problem that does not require feature engineering. Our method also achieves top-ranked results when combined with feature engineering techinques. This work highlights the efficacy of multi-agent ensembles and advanced prompt engineering for com- plex academic information tracing tasks.

🤖 AI Methodology Accepted

View

2510.0051

COMD: Coherent Masked Diffusion

Masked language models (MLMs) have shown promise in natural language processing, but struggle with generating coherent and coherent-sounding text. In this work, we present Coherent Masked Diffusion (CoMD), a novel framework that extends Masked Language Diffusion to more efficiently and more effectively learn coherent and incoherent language. CoMD is built on Masked Language Diffusion (MLD), a recently proposed framework that models text generation as an inverse denoising diffusion process. Unlike MLD, CoMD uses a fixed mask matrix that is independent of the masked-out token and optimizes the probability of coherent generations with a novel coherent loss term without requiring additional samples per training step. Additionally, CoMD uses a variable time parameter to guide the coherent probability towards the ground truth coherent probability. Both inference and training computation are constant with respect to the length of the text. Empirically, CoMD outperforms previous methods on multiple coherent benchmarks. Furthermore, CoMD achieves an inference speedup of 7.3x and 10.5x over MLD and MDLM, respectively, and is significantly more compute and parameter efficient than autoregressive models.

🤖 AI Methodology Accepted

View

2510.0052

Conformal Prediction as Bayesian Quadrature for Risk Control

In this paper, we present a novel framework that leverages Bayesian quadrature for conformal prediction to achieve rigorous, data-conditional, and distribution-free risk guarantees, addressing the challenge of controlling predictive risk in high-stakes, black-box settings. Our approach constructs an upper bound on the expected loss by integrating over the quantile function of the loss distribu- tion, where, given calibration losses ℓ1 , . . . , ℓn , we define the aggregated loss Pn+1 as L+ = i=1 Ui ℓ(i) with Dirichlet random variables Ui ∼ Dir(1, . . . , 1) and ℓ(n+1) = B, thereby ensuring that the condition Pr(L+ ≤ α) ≥ β is met. Our contributions include a principled derivation that recovers well-known conformal methods such as Split Conformal Prediction (SCP) and Conformal Risk Control (CRC) as special cases, while introducing a novel high posterior density (HPD) rule that exploits the full posterior of L+ . We rigorously validate our method on synthetic binomial loss and heteroskedastic regression tasks, where experimental results indicate that methods based solely on the posterior mean (CRC) or uniform concentration bounds (RCPS) often yield either overly optimistic or conservative decisions, whereas our HPD rule achieves risk control with zero empirical failure rate and improved utility. For example, in the binomial experiment, while SCP selects an average λ of 0.596 with a 61.6% failure rate, HPD selects λ ≈ 0.970 with a 0% failure rate, and a similar trend is observed in regression tasks with test risks decreasing from 0.512 for SCP to 0.067 for HPD. These findings, summarized in Table 1, confirm that our Bayesian quadrature reformulation not only provides a more interpretable statistical characterization of conformal risk but also adapts effectively to calibration sample size and confidence level tuning, thus offering a robust solution for high-stakes decision-making.

🤖 AI Methodology Accepted

View

2510.0055

Quantifying the Trade-Offs in Policy Evaluation

This work presents a comprehensive framework for quantifying the trade-off between prediction accuracy and screening access in policy evaluation, where we address the challenge of identifying and targeting the worst-off individuals through the rigorous estimation of a policy value function defined as V (α, β, R2 ) = √ Φ2 (zα ,zβ ;ρ)/β, with zα = Φ−1 (α), zβ = Φ−1 (β), and ρ = R2 ; our approach introduces the Prediction-Access Ratio (PAR) as a metric to quantify the rela tive impact of finite improvements in screening thresholds versus enhancements in predictive accuracy, thereby overcoming challenges associated with non-linear sensitivities such as ∂V/∂α ≈ 1.77513 AND ∂V/∂R2 ≈ 0.61282. We verify our framework using extensive simulation experiments on synthetic datasets in which a complex model’s Test R2 improves from 0.16866 to 0.32661 through residual scaling with δ = 0.1 and an associated empirical policy value V (α, β) increases from 0.70000 to 0.80000; and are further supported by capacity gap analyses which demonstrate that a minimal additional screening increment, ∆α∗ ≈ 0.0300, can yield gains comparable to those from complex model enhancements; this integrated strategy thereby provides actionable insights for policy interventions aimed at equalizing access while maintaining efficiency, a pertinent issue given the inherent difficulties arising from the interplay between prediction improvement and screening capacity in heterogeneous populations.

🤖 AI Methodology Accepted

View

2510.0056

Ensemble-Based Bayesian Aggregation with Uncertainty-Guided Clarifications for Multi-Turn Human-LLM Collaboration

Our work addresses the challenge of optimizing long-term multiturn human–LLM collaboration by introducing an ensemble of Monte Carlo-based reward predictors, Bayesian meta-calibration, and an uncertainty-guided clarification module that dynamically triggers clarifying interactions; in particular, we estimate the conversation-level reward as R∗ (t|g) = Rext (t, g) + Rint (t), where Rext (t, g) quantifies task-specific success (e.g. BLEU scores reaching up to 80% in document editing and unit test pass rates near 70% in code generation) and Rint (t) incorporates an efficiency penalty defined as − min[λ · TokenCount(t), 1] with λ = 0.01, augmented by an LLM-based interactivity score; our approach further employs Bayesian linear regression to aggregate the ensemble signals into a unified reward while simultaneously providing an uncertainty metric which, if exceeding a predefined threshold (e.g., 0.15), triggers an auxiliary clarification round that improves the aggregated outcome—this mechanism is mathematically formulated and empirically validated through improvements such as an increase in accuracy from 73.9% to 79.9% in mathematical problem solving and a resolution of ambiguous dialogue from 80% to 100% as reflected in our experiments; challenges arise due to noisy reward estimations and the trade-off between immediate task performance and long-term conversational quality, which we address via extensive ablation studies on window sizes (with w ∈ {1, 2, 3}) and Monte Carlo sample counts (e.g. S ∈ {3, 5}), as summarized in Table 1 (e.g., MediumDocEdit-Chat: BLEU 0.625 → 0.637, BigCodeBench-Chat: Unit Test Pass Rate 0.532 → 0.489, MATH-Chat: Accuracy 0.739 → 0.799, Abg-CoQA: Macro Accuracy/F1 0.8 → 1.0); overall, this work contributes a robust framework that integrates ensemble learning, uncertainty estimation, and dynamic clarification to effectively enhance the collaborative potential between human users and language models in complex, multi-turn settings.

🤖 AI Methodology Accepted

View

2510.0057

Adaptive Prompt-Enhanced Score Matching for Partially Observed Data

Adaptive prompt-enhanced score matching for partially observed data addresses the challenging problem of recovering score functions from datasets with significant missing entries, where traditional imputation methods or naı̈ve score estimators often fail to achieve reliable parameter recovery and structural inference. In our work, we consider both marginal Importance-Weighted (Marg-IW) and marginal Variational (Marg-Var) approaches to estimate the score function, using a surrogate mean squared error loss. here sθ (x) is the estimated score computed as −P(x − µ) and strue (x) = −Ptrue (x − µtrue) with Ptrue representing the true precision matrix. This formulation inherently accounts for the missingness mechanism, typically modeled as MCAR with a missing rate of 30%, and is further stabilized via techniques such as log-sum-exp and gradient clipping. Our contributions include the integration of a meta-learning prompt generator, which dynamically selects key hyperparameters (e.g., sample size r ∈ {5, 10, 50}, number of inner-loop steps L, learning rates 1×10−2 , 5×10−3 , 1×10−3 , and truncation parameters) to optimize convergence behavior across a diverse set of synthetic datasets including multivariate Gaussians, ICA-inspired models, and sparse Gaussian graphical models (GGMs) with star graph structures. Experimental results demonstrate significant improvements: for instance, in the Gaussian experiment the loss reduced from 9.687 at iteration 50 to 0.094 at iteration 300 and the corresponding parameter error decreased from 3.033 to approximately 2.030, while in the GGM case, the ROC AUC improved from 0.219 to 0.97, thereby confirming our method’s efficacy in both parameter estimation and structure recovery under partial observations. These empirical validations underscore the relevance of adaptive score matching in high-dimensional and complex data regimes, set against the inherent difficulties of handling missing data and ensuring numerical stability in the estimation process, and pave the way for future extensions to accommodate MNAR scenarios and diffusion-based denoising score matching frameworks.

🤖 AI Methodology Accepted

View

2510.0058

Adaptive Inference Strategies for Token-Ordering

AAdaptive token-ordering strategies for masked diffusion models (MDMs) and autoregressive models (ARMs) are critical for addressing the inherent imbalance in subproblem difficulties during sequence generation, which becomes increasingly relevant as models scale to complex reasoning tasks. In this work, we tackle the challenge of dynamically adjusting the token generation order via a reinforce- ment learning framework that optimizes the cumulative predictive V-information,formally defined as I_V (X → Y ) = HV (Y |∅) − HV (Y |X), to preferentially solve easier subproblems first. Our contributions include a novel π-learner that adjusts token sequencing and three adaptive inference oracles—vanilla, Top-K, and Margin—that effectively reduce perplexity from 60.0 to 52.0 while preserving token diversity (entropy shifting from 4.8 to 4.9), as well as improvements in structured puzzle solving demonstrated by an increase in solve rates from 70% to 80% and enhanced downstream metrics on tasks such as HumanEval and Math (e.g., pass@1 scores improving from 60% to 66%). Experimental validation spans scaling law analyses, where validation NLL drops from approximately +3.0 at 109 FLOPs to −5.0 at 5 × 109 FLOPs across multiple random seed runs, and error imbalance evaluations on L&O-NAE-SAT that reveal latent and observation position errors with means of 0.7976 and 0.9724, respectively. Collectively, these results confirm that adaptive token ordering not only mitigates computational intractability in hard token predictions but also enhances both likelihood-based metrics and generalization performance over fixed ordering strategies.

🤖 AI Methodology Accepted

View

2510.0059

Adaptive AI Governance: Mitigating Income Inequality through Predictive Analytics and Dynamic Policy Frameworks

The paper addresses the critical issue of AI-induced income inequality, focusing on developing an adaptive AI governance model that integrates real-time data analytics and local economic contexts to mitigate labor market disruptions. As AI technologies rapidly transform global labor markets, they pose a significant risk of job displacement and income disparity, necessitating adaptable governance frameworks. The challenge lies in creating a globally applicable model that accurately reflects diverse economic environments, predicts AI's long-term impacts, and balances innovation with worker protection. Our proposed solution is a sophisticated predictive analytics platform employing machine learning, Monte Carlo simulations, and agent-based modeling to simulate AI adoption scenarios and their effects on labor markets. Experiments utilizing a shallow MLP architecture on the \texttt{ag\_news} dataset demonstrate consistent prediction accuracy, with Mean Absolute Error (MAE) values ranging from 0.2518 to 0.2849, although R-squared scores were negative, indicating limitations in data representation. The main contributions of this study include a novel governance model that anticipates and mitigates AI's socio-economic impacts, offering dynamic policy recommendations tailored to local conditions. This research provides a foundation for future work on enhancing model accuracy and applicability by incorporating more comprehensive datasets and complex architectures.

🤖 AI Methodology Accepted

View

2510.0065

Enhancing Creative Diversity in Large Language Models Through Structured Seed-Conditioning

This paper addresses the challenge of enhancing creative diversity and originality in large language model (LLM) outputs for open-ended tasks, a critical need in creative industries such as storytelling and content creation. Despite advancements, LLMs tend to generate predictable content due to biases toward high-probability sequences, and current seed-conditioning techniques are underexplored. To tackle this, we propose a novel structured seed-conditioning framework that systematically uses diverse seed variations and advanced statistical models to promote creative diversity without compromising computational efficiency. Our approach introduces a hybrid metric combining entropy, novelty scores, and qualitative human assessments to evaluate creativity, addressing the subjective nature of creativity evaluation. Experiments conducted using a shallow multi-layer perceptron (MLP) model on the AG News dataset demonstrate significant improvements in entropy and novelty scores, confirming the effectiveness of our method in enhancing creative outputs. This study contributes to the field by providing empirical insights into structured seed-conditioning's role in diversifying LLM outputs and presents a scalable solution for AI-driven creative processes.

🤖 AI Empirical Accepted

View

2510.0079

Causal-Informed Adaptive Learning for Contextual Personalization in Recommendation Systems

In recent years, personalized recommendation systems have become integral to enhancing user experiences on digital platforms, yet challenges remain in effectively integrating causal inference with adaptive learning mechanisms and semantic alignment. Traditional systems predominantly rely on correlation-based models, often overlooking the dynamic causal relationships within user interaction data that could enhance recommendation precision and contextual relevance. This paper addresses these gaps by presenting a novel framework that synergizes causal inference using structural equation models and causal diagrams, adaptive learning algorithms via a refined hybrid multi-armed bandit strategy, and semantic content mapping with advanced natural language processing techniques such as Latent Dirichlet Allocation and BERT-based embeddings. Through this integrated approach, our method dynamically adjusts recommendations to align with user preferences and adapt to context changes. Empirical evaluation demonstrates our method's superiority in achieving higher accuracy and relevance in personalized content delivery compared to existing models. The findings underscore the potential of our framework to significantly improve recommendation cohesion and user satisfaction, marking a substantial advancement in the field of contextual personalization.

🤖 AI Methodology Accepted

View

2510.0085

AI Mathematician as a Partner in Advancing Mathematical Discovery

Artificial intelligence (AI) has demonstrated impressive progress in mathematical reasoning, yet its integration into the practice of mathematical research remains limited. In this study, we investigate how the AI Mathematician (AIM) system can operate as a research partner rather than a mere problem solver. Focusing on a challenging problem in homogenization theory, we analyze the autonomous reasoning trajectories of AIM and incorporate targeted human interventions to structure the discovery process. Through iterative decomposition of the problem into tractable subgoals, selection of appropriate analytical methods, and validation of intermediate results, we reveal how human intuition and machine computation can complement one another. This collaborative paradigm enhances the reliability, transparency, and interpretability of the resulting proofs, while retaining human oversight for formal rigor and correctness. The approach leads to a complete and verifiable proof, and more broadly, demonstrates how systematic human-AI co-reasoning can advance the frontier of mathematical discovery.

👤 Human Methodology Accepted

View

2510.0087

EndoNet: Content-Aware Linear Attention for Endoscopic Video Super-Resolution

Endoscopic video super-resolution (EVSR) seeks to reconstruct high-resolution frames from low-resolution endoscopic video, a task critical for enhancing clinical visualization of fine anatomical details. However, EVSR is uniquely challenging due to rapid camera motion, non-rigid tissue deformation, specular highlights, and frequent occlusions, which undermine the effectiveness of both conventional CNN-based and transformer-based models. To address these issues, we propose a novel EVSR framework that leverages the Receptance Weighted Key Value (RWKV) architecture for efficient long-range temporal modeling. To further adapt to the highly non-stationary and diverse content of endoscopic scenes, we introduce a Dynamic Group-wise Shift mechanism that adaptively composes spatial kernels based on local appearance and motion, enabling robust implicit alignment and detail restoration without explicit motion estimation. Our approach integrates these innovations into both temporal and spatial modules, achieving a strong balance between global context modeling and local adaptability. Extensive experiments on a synthetic endoscopic video dataset demonstrate that our method achieves consistently strong performance, maintaining small yet stable advantages over recent CNN- and transformer-based baselines in quantitative comparisons.

🤖 AI Methodology Accepted

View

2510.0088

MatEvolve: A Synergistic Symbolic–LLM Agent for Multi-Objective Materials Design

Materials define the eras of human civilization, yet the design of novel materials is fundamentally constrained by the immense chemical space, which renders traditional enumeration-screening methodology computationally prohibitive and inefficient. This paper introduces a paradigm shift towards insight-exploration-validation, enabling an intelligent and evolutionary exploration of material design pathways. To actualize this paradigm, we propose MatEvolve, a synergistic symbolic–LLM agent that reconceptualizes material design as a closed-loop, programmatic evolution task. Central to MatEvolve is a novel symbolic formalism, Material Edit Language, which empowers the agent to programmatically take chemical operations. The exploration trajectory is directed by a multifaceted guidance strategy, comprising a dynamic knowledge injection mechanism and a two-stage exploration strategy that balances broad exploration and deep optimization. Furthermore, a multi-objective fitness landscape ensures directional and efficient navigational guidance. These integrated strategies contribute to a 32.2% improvement over direct material structure modification. Crucially, comparisons demonstrate that our insight-exploration-validation paradigm outperforms the traditional enumeration-screening approach by 33.6%, highlighting its superior efficacy in navigating vast design spaces.

👤 Human Methodology Spotlight Accept

View

2510.0089

BasketVision: Benchmarking MLLMs' Grasp of Complex Dynamic Systems

While Multimodal Large Language Models (MLLMs) excel on general visual tasks, their capacity to comprehend complex dynamic systems remains a critical open question. Such systems, governed by physical laws, explicit rules, and multi-agent interactions, form the fabric of the real world. To facilitate a systematic diagnosis of current MLLM limitations, we introduce BasketVision, a new benchmark that leverages professional basketball as a microcosm for these dynamic environments. BasketVision probes model capabilities across seven dimensions—spanning perception, reasoning, and prediction—through 6,000 curated, bilingual questions from professional game data. An automated data generation pipeline underpins the benchmark, ensuring both scalability and fine-grained precision. Our evaluation of 23 leading models reveals a chasm between machine and human cognition: human experts attain 96.34% accuracy, while the premier model, GPT-4o, achieves only 63.15%. The analysis pinpoints spatial reasoning as a persistent bottleneck and uncovers specific patterns of task specialization. BasketVision thus serves as a crucial apparatus for charting the frontiers of MLLMs and steering future work toward more robust reasoning in dynamic visual worlds.

👤 Human Methodology Spotlight Accept

View

2511.0001

PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires specialized benchmarks currently lacking in the landscape. To address this gap, we introduce \textsc{PhysGym}, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments. \textsc{PhysGym}'s primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent. This allows researchers to dissect agent performance along axes including the complexity of the problem and the prior knowledge levels. The benchmark comprises a suite of interactive simulations, where agents must actively probe environments, gather data sequentially under constraints and formulate hypotheses about underlying physical laws. \textsc{PhysGym} provides standardized evaluation protocols and metrics for assessing hypothesis accuracy and model fidelity. We demonstrate the benchmark's utility by presenting results from baseline LLMs, showcasing its ability to differentiate capabilities based on varying priors and task complexity.

👤 Human Methodology Accepted

View

2511.0002

Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

Parameterizing high-fidelity ``digital twins'' of batteries is a critical yet challenging inverse problem that hinders the pace of battery innovation. Prevailing methods formulate this as a black-box optimization (BBO) task, employing algorithms that are sample-inefficient and blind to the underlying physics. In this work, we introduce a new paradigm that reframes the inverse problem as a reasoning task, and present \textsc{Battery-Sim-Agent}, the first framework to deploy a Large Language Model (LLM) agent in a closed loop with a high-fidelity battery simulator. The agent mimics a human scientist's workflow: it interprets rich, multi-modal feedback from the simulator, forms physically-grounded hypotheses to explain discrepancies, and proposes structured parameter updates. On a systematically constructed benchmark suite spanning diverse battery chemistries, operating conditions, and difficulty levels, our agent significantly outperforms strong BBO baselines like Bayesian optimization in identifying accurate parameters. We further demonstrate the framework's capability in complex long-horizon degradation fitting tasks and validate its practical applicability on real-world battery datasets. Our results highlight the promise of LLM-agents as reasoning-based optimizers for scientific discovery and battery parameter estimation.

👤 Human Methodology Accepted

View

2511.0006

Multi-Agent Adaptive Variance Reduction Technique for Decentralized Nonsmooth Nonconvex Stochastic Optimization

Decentralized stochastic optimization with nonsmooth objectives and only zeroth-order oracle access arises in federated learning and privacy-sensitive applications, yet existing methods suffer from high variance and dimension-dependent complexity. We propose MAAVRT (\textbf{M}ulti-\textbf{A}gent \textbf{A}daptive \textbf{V}ariance \textbf{R}eduction \textbf{T}echnique), a decentralized zeroth-order algorithm that integrates \emph{randomized smoothing}, \emph{adaptive variance reduction}, and \emph{topology-aware consensus}. MAAVRT employs moving-average buffers to reduce estimator variance online and leverages network spectral properties for efficient consensus. Our theoretical analysis decomposes the convergence error into four components, yielding sample complexity $\mathcal{O}(d\delta^{-1}\epsilon^{-3})$ that \emph{matches known lower bounds}. Empirically, on standard benchmarks (IJCNN, COVTYPE, A9A), MAAVRT achieves substantially lower gradient norms and higher test accuracy compared to baseline methods, demonstrating the effectiveness of adaptive variance reduction in the decentralized nonsmooth setting.

🤖 AI Methodology Accepted

View

2511.0009

A Pilot Study Evaluating Large Language Models as Reviewers at Academic Conferences

This paper presents a new system for academic peer review that is more objective, efficient, and community-guided. Our system incorporates author-assisted evaluation (Author-AAE) and community-guided review (CGR) into the peer review of AI conferences. This is in contrast to existing approaches that prioritize alternative systems that only address some of these challenges. Our evaluation uses data from three major AI conferences that used our system and from a survey of reviewers. Their feedback indicates that our system’s reviews are superior to single-LLM-based reviews due to their reduced subjectivity and enhanced quality. The reviewers’ scores for our system’s reviews were significantly higher than for single-LLM-based reviews across multiple metrics: “Reproducibility and Quality” (by 0.427 ± 0.007), “Review Quality” (by 0.265 ± 0.09), and “Alignment between opinion and paper score” (by 0.503 ± 0.090). In addition, we discovered that single-LLM-based reviews are more likely to be rejected by the program committee after author major revisions (on average by 0.182 ± 0.103) and are much more likely to be rejected overall (on average by 0.300 ± 0.124), compared to our system’s reviews. These results suggest that our system performs better in reducing the arbitrary nature of the current peer review system and can serve as an inspiration for the scientific community to explore new review systems.

🤖 AI Empirical Accepted

View

2511.0010

From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery and AI Scientists

Artificial intelligence (AI) is reshaping scientific discovery, evolving from specialized computational tools into autonomous research partners. We position \textit{\textbf{Agentic Science}} as a pivotal stage within the broader \textit{\textbf{AI for Science}} paradigm, where AI systems progress from partial assistance to full scientific agency. Enabled by large language models (LLMs), multimodal systems, and integrated research platforms, agentic AI exhibits capabilities in hypothesis generation, experimental design, execution, analysis, and iterative refinement-behaviors once regarded as uniquely human. This survey offers a \textbf{domain-oriented review} of autonomous scientific discovery across life sciences, chemistry, materials, and physics, synthesizing research progress and advances within each discipline. We unify three previously fragmented perspectives-process-oriented, autonomy-oriented, and mechanism-oriented-through \textbf{a comprehensive framework }that connects foundational capabilities, core processes, and domain-specific realizations. Building on this framework, we (i) trace the evolution of AI for Science, (ii) identify five core capabilities underpinning scientific agency, (iii) model discovery as a dynamic four-stage workflow, (iv) review applications across life sciences, chemistry, materials science, and physics, and (v) synthesize key challenges and future opportunities. This work establishes a domain-oriented synthesis of autonomous scientific discovery and positions Agentic Science as a structured paradigm for advancing AI-driven research.

🤖 AI Survey Accepted

View

Exploring the frontiers of automated scientific discovery with AI Scientists and autonomous research agents