ICAIS 2025
Full name: The 1st International Conference on AI Scientist
-
2510.0044ViewA Comprehensive Survey on Deep LearingMachine learning and deep learning methodologies have revolutionized computational approaches to complex problem-solving across numerous domains, emerging as transformative technologies in artificial intelligence research [1,7]. This comprehensive review synthesizes current literature to examine the theoretical foundations, methodological advancements, and practical implementations of these techniques, highlighting their evolution from basic machine learning concepts to sophisticated deep neural architectures [2,9]. The analysis demonstrates remarkable success in applications ranging from computer vision and natural language processing to healthcare diagnostics and autonomous systems, with deep learning models achieving unprecedented performance in pattern recognition tasks [3,4,8]. However, significant challenges persist, including the need for massive labeled datasets, computational resource requirements, model interpretability issues, and inherent parameter redundancy in deep architectures [5,6]. The review identifies emerging opportunities in transfer learning, few-shot learning, and explainable AI as promising research directions [10]. By critically evaluating both current limitations and future potential, this analysis provides a structured framework for researchers to advance the field while addressing practical implementation barriers across diverse application domains.
-
2510.0045ViewPST-AUTO-AGENT: A Multi-Agent Ensemble Framework for Paper Source TracingThe escalating volume of scientific literature necessitates efficient methods for identifying foundational works that significantly inform new research. This paper addresses the Paper Source Tracing (PST) problem, which aims to quantify the influence of cited references on a focal paper, assigning importance weights to its most salient sources. To this end, we propose a novel multi-agent ensemble architecture for PST, integrating Deepseek-R1-250528, GPT-5-2025-08-07, and Gemini-2.5-pro. Our system employs a robust pipeline, featuring advanced XML parsing, empirically optimized prompt engineering with counterfactual reasoning and multi-role Socratic dialogue, and a sophisticated multi-agent integration strat- egy. This strategy utilizes weighted model predictions, intelligent default scoring, and a consistency penalty mechanism to derive precise source paper identifica- tions. Our method becomes a strong tuning-free baseline for the PST problem that does not require feature engineering. Our method also achieves top-ranked results when combined with feature engineering techinques. This work highlights the efficacy of multi-agent ensembles and advanced prompt engineering for com- plex academic information tracing tasks.
-
2510.0046ViewEconomic Implications of Language Models and Copyright LawHow will language models (LMs) affect future economic progress? Inspired by the Lever of Riches by Mokyr (1992), we argue that the institutions governing LM content generation and usage patterns are critical to answering this question. We content that, because LM creators have a strong incentive to collect, train, and deploy intellectual property protection, the all-you-can-consume access to knowledge and creativity they enable has led to rapid acceptance and widespread use, which in turn results in smaller, low-skill employment creation but increased output and greater overall welfare. We provide a theoretical and analytical framework explaining this phenomenon and point to its long-term consequences using empirical evidence.
-
2510.0047ViewPredictive Need Assessment, Public Service Providers and Inequalities of Labor Market OutcomesAid and assistance are key to reducing social inequalities. In the public sector, aid providers face the challenge to distribute resources according to needs. In recent years, algorithms for need assessment have become an integral part of public institutions. While need assessment is crucial, the implications for the distribution of aid and assistance in the public sector are still poorly understood. In this work, we investigate how the use of predictive models for need assessment impacts the distribution of aid and assistance in the German public employment service. To this end, we develop a synthetic dataset for the “first round” assignment in the German public employment service based on regional data from the State of Bavaria in 2019. Our dataset comprises labels for 85,299 out of 275,889 employed and unemployed, treated and untargeted individuals in 2019. The label indicates whether a person received a prioritized status and thus privileged access to government services. We find that the use of predictive models leads to significant resource imbalances and deepens the divides along the lines of migration background, education, and gender. These findings highlight important ethical implications for public service providers in need of need assessment for aid and assistance.
-
2510.0048ViewRisk Control With Width-Sketching AlgorithmsWe introduce the notion of width-sketching algorithms, defined as algorithms with provably bounded width (that is, probability of containing the randomness) for the induced coverage set. For algorithms that sketch the width, we prove a novel uniform upper bound and provide an instance where the width in expectation is twice as large as the optimal width. We then introduce the width-optimality notion and an approximate version termed mean-width optimality, which allows us to derive algorithms with the desired coverage while minimizing the mean width. We provide a high-level perspective on the relationship with depth-sketching algorithms, i.e., algorithms that sketch the depth of the induced sets with probability 1 − α, and show that they provide complementary forms of coverage. Finally, we demonstrate the application of the framework to conformal prediction with Bayesian quadrature.
-
2510.0049ViewLearning Unnormalized Models with Missing Data via Adversarial Score MatchingLearning unnormalized model parameters is a challenging task that frequently arises in various scientific fields. Score matching is a promising method to learn unnormalized models by estimating the score function. However, score matching has several practical challenges in real-world applications, including the need for an auxiliary network to estimate the score function, the requirement for the model to support sampling, and the difficulty of estimating the score function for high-dimensional data. To address these challenges, we propose adversarial score matching (ASM), an adversarial learning algorithm for learning unnormalized models, which does not require an auxiliary network and can be applied to high-dimensional data. We also propose a multilevel Monte Carlo estimator for the score discrepancy, which is computationally more efficient than the traditional importance sampling estimator. In addition, we demonstrate that ASM is a mode-seeking algorithm, which has been observed empirically in a variety of adversarial learning methods. We evaluate the performance of ASM on various unnormalized models and missing data mechanisms, and demonstrate that ASM outperforms existing score matching methods.
-
2510.0050ViewU-CAN: User-Guided Clarification for Asking Clarification in Asking Across Needs FrameworkIt is still unclear if and how methods developed specifically on asking clarification for retrieval or problem-solving in the academic community can effectively address user needs during human-computer interactions (HCI). In this work, we first propose an Asking Across Needs (AAN) framework to explore the complexities of HCI, including user needs, interaction styles, and interaction types, by building an interaction graph (Pearl, 2009) containing user and LLM actions. Then, we create a new benchmark, UsClarification for Asking Needs (U-CAN), containing task-oriented asking clarification and retrieval-related asking clarification which align with real-world HCI scenarios. Specifically, we design new interaction graph designs and user-guided prompting techniques based on our AAN framework to address multiple user needs not met in existing HCI studies. We find that task-oriented needs are often left unmet, and existing methods show performance gaps between simulated and real-world (enrolled students) settings. We also demonstrate that HCI can be facilitated by interaction graphs on retrieval-related asking clarification using our proposed interactive graph model.
-
2510.0051ViewCOMD: Coherent Masked DiffusionMasked language models (MLMs) have shown promise in natural language processing, but struggle with generating coherent and coherent-sounding text. In this work, we present Coherent Masked Diffusion (CoMD), a novel framework that extends Masked Language Diffusion to more efficiently and more effectively learn coherent and incoherent language. CoMD is built on Masked Language Diffusion (MLD), a recently proposed framework that models text generation as an inverse denoising diffusion process. Unlike MLD, CoMD uses a fixed mask matrix that is independent of the masked-out token and optimizes the probability of coherent generations with a novel coherent loss term without requiring additional samples per training step. Additionally, CoMD uses a variable time parameter to guide the coherent probability towards the ground truth coherent probability. Both inference and training computation are constant with respect to the length of the text. Empirically, CoMD outperforms previous methods on multiple coherent benchmarks. Furthermore, CoMD achieves an inference speedup of 7.3x and 10.5x over MLD and MDLM, respectively, and is significantly more compute and parameter efficient than autoregressive models.
-
2510.0052ViewConformal Prediction as Bayesian Quadrature for Risk ControlIn this paper, we present a novel framework that leverages Bayesian quadrature for conformal prediction to achieve rigorous, data-conditional, and distribution-free risk guarantees, addressing the challenge of controlling predictive risk in high-stakes, black-box settings. Our approach constructs an upper bound on the expected loss by integrating over the quantile function of the loss distribu- tion, where, given calibration losses ℓ1 , . . . , ℓn , we define the aggregated loss Pn+1 as L+ = i=1 Ui ℓ(i) with Dirichlet random variables Ui ∼ Dir(1, . . . , 1) and ℓ(n+1) = B, thereby ensuring that the condition Pr(L+ ≤ α) ≥ β is met. Our contributions include a principled derivation that recovers well-known conformal methods such as Split Conformal Prediction (SCP) and Conformal Risk Control (CRC) as special cases, while introducing a novel high posterior density (HPD) rule that exploits the full posterior of L+ . We rigorously validate our method on synthetic binomial loss and heteroskedastic regression tasks, where experimental results indicate that methods based solely on the posterior mean (CRC) or uniform concentration bounds (RCPS) often yield either overly optimistic or conservative decisions, whereas our HPD rule achieves risk control with zero empirical failure rate and improved utility. For example, in the binomial experiment, while SCP selects an average λ of 0.596 with a 61.6% failure rate, HPD selects λ ≈ 0.970 with a 0% failure rate, and a similar trend is observed in regression tasks with test risks decreasing from 0.512 for SCP to 0.067 for HPD. These findings, summarized in Table 1, confirm that our Bayesian quadrature reformulation not only provides a more interpretable statistical characterization of conformal risk but also adapts effectively to calibration sample size and confidence level tuning, thus offering a robust solution for high-stakes decision-making.
-
2510.0053ViewChatGPT Event Labor Impact Simulation via Two-Stage Dynamic Prompt TuningIn this work, we propose a scalable framework to simulate the labor market impacts of the ChatGPT event using a two-stage dynamic prompt tuning mechanism combined with an LLM-based qualitative classifier; our objective is to operationalize labor displacement signals (P1) alongside shared prosperity (P3) and detectability (P6) by addressing the inherent challenges of dynamic prompt adaptation and qualitative taxonomy mapping. We tackle the complexity of evolving labor market signals through gradient-based meta-learning updates, modeled as ∆s = αst−1 + ϵ, and employ a Difference-in-Differences regression of the form Yit = β0 + β1 Treatmentit + γi + δt + εit to quantify the impact on employment metrics, notably obtaining a significant negative treatment coefficient of approximately −5.71 (with p < 0.001). Our qualitative classifier achieves a robust accuracy of 74.97% in mapping job narratives to six predefined propositions, and supplemental analyses—such as a principal component analysis (PCA) yielding an AI Capacity Index and its near-zero correlation (r ≈ 0.00) with an exposure index—underscore the potential of our approach in capturing nuanced socioeconomic dynamics. Furthermore, experimental validations across an 8-week analytical window demonstrate consistent incremental improvements in prompt quality scores, with average weekly gains estimated at up to 5%, thereby confirming that our integrated methodology not only enhances transparency and reproducibility but also provides concrete insights into AI-induced labor displacement.
-
2510.0054ViewExplorations in Algorithmic Creativity via Next-Token and Multi-Token ApproachesAlgorithmic creativity in text generation poses significant challenges in balancing coherence, diversity, and memorization, and our study addresses these challenges by systematically comparing traditional next-token prediction (NTP) with multi-token teacherless prediction (MTP) and discrete diffusion methods (SEDD) across minimal yet representative combinatorial tasks such as Sibling Discovery, Triangle Discovery, Circle Construction, and Line Construction; our primary objective is to maximize the creative output defined as the fraction of generated samples that satisfy task-specific outputs validity criteria, which we quantify as ĉr = #coherent/#total outputs, and to minimize memorization, observed to drop from 100% under deterministic conditions to near 0% when employing controlled stochastic, while diversity is measured by D = |{unique-outputs}|total outputs with values reaching up to 1.00 in optimized settings; to achieve these ends, we introduce seed-conditioning and temperature scaling—modeled by the parameter T where T = 0 corresponds to greedy decoding and T > 0 introduces controlled noise following the relation pnoise = min(0.9, α × T ) with α varying by method—to guide the output generation process, and we formulate an alignment loss to ensure semantic consistency between the restrictive and adaptive prompts; extensive experimentation and rigorous ablation studies, as summarized in Table 1 (detailing coherence rates between 50% and 80%, memorization rates dropping from 100% to nearly 0%, and diversity metrics peaking at 1.00), validate that both MTP and SEDD outperform NTP under non-deterministic settings and when augmented with seed-conditioning, thereby demonstrating that our hybrid framework not only pushes the boundaries of algorithmic creativity on minimal open-ended tasks but also offers a scalable approach for more complex problem domains.
-
2510.0055ViewQuantifying the Trade-Offs in Policy EvaluationThis work presents a comprehensive framework for quantifying the trade-off between prediction accuracy and screening access in policy evaluation, where we address the challenge of identifying and targeting the worst-off individuals through the rigorous estimation of a policy value function defined as V (α, β, R2 ) = √ Φ2 (zα ,zβ ;ρ)/β, with zα = Φ−1 (α), zβ = Φ−1 (β), and ρ = R2 ; our approach introduces the Prediction-Access Ratio (PAR) as a metric to quantify the rela tive impact of finite improvements in screening thresholds versus enhancements in predictive accuracy, thereby overcoming challenges associated with non-linear sensitivities such as ∂V/∂α ≈ 1.77513 AND ∂V/∂R2 ≈ 0.61282. We verify our framework using extensive simulation experiments on synthetic datasets in which a complex model’s Test R2 improves from 0.16866 to 0.32661 through residual scaling with δ = 0.1 and an associated empirical policy value V (α, β) increases from 0.70000 to 0.80000; and are further supported by capacity gap analyses which demonstrate that a minimal additional screening increment, ∆α∗ ≈ 0.0300, can yield gains comparable to those from complex model enhancements; this integrated strategy thereby provides actionable insights for policy interventions aimed at equalizing access while maintaining efficiency, a pertinent issue given the inherent difficulties arising from the interplay between prediction improvement and screening capacity in heterogeneous populations.
-
2510.0056ViewEnsemble-Based Bayesian Aggregation with Uncertainty-Guided Clarifications for Multi-Turn Human-LLM CollaborationOur work addresses the challenge of optimizing long-term multiturn human–LLM collaboration by introducing an ensemble of Monte Carlo-based reward predictors, Bayesian meta-calibration, and an uncertainty-guided clarification module that dynamically triggers clarifying interactions; in particular, we estimate the conversation-level reward as R∗ (t|g) = Rext (t, g) + Rint (t), where Rext (t, g) quantifies task-specific success (e.g. BLEU scores reaching up to 80% in document editing and unit test pass rates near 70% in code generation) and Rint (t) incorporates an efficiency penalty defined as − min[λ · TokenCount(t), 1] with λ = 0.01, augmented by an LLM-based interactivity score; our approach further employs Bayesian linear regression to aggregate the ensemble signals into a unified reward while simultaneously providing an uncertainty metric which, if exceeding a predefined threshold (e.g., 0.15), triggers an auxiliary clarification round that improves the aggregated outcome—this mechanism is mathematically formulated and empirically validated through improvements such as an increase in accuracy from 73.9% to 79.9% in mathematical problem solving and a resolution of ambiguous dialogue from 80% to 100% as reflected in our experiments; challenges arise due to noisy reward estimations and the trade-off between immediate task performance and long-term conversational quality, which we address via extensive ablation studies on window sizes (with w ∈ {1, 2, 3}) and Monte Carlo sample counts (e.g. S ∈ {3, 5}), as summarized in Table 1 (e.g., MediumDocEdit-Chat: BLEU 0.625 → 0.637, BigCodeBench-Chat: Unit Test Pass Rate 0.532 → 0.489, MATH-Chat: Accuracy 0.739 → 0.799, Abg-CoQA: Macro Accuracy/F1 0.8 → 1.0); overall, this work contributes a robust framework that integrates ensemble learning, uncertainty estimation, and dynamic clarification to effectively enhance the collaborative potential between human users and language models in complex, multi-turn settings.
-
2510.0057ViewAdaptive Prompt-Enhanced Score Matching for Partially Observed DataAdaptive prompt-enhanced score matching for partially observed data addresses the challenging problem of recovering score functions from datasets with significant missing entries, where traditional imputation methods or naı̈ve score estimators often fail to achieve reliable parameter recovery and structural inference. In our work, we consider both marginal Importance-Weighted (Marg-IW) and marginal Variational (Marg-Var) approaches to estimate the score function, using a surrogate mean squared error loss. here sθ (x) is the estimated score computed as −P(x − µ) and strue (x) = −Ptrue (x − µtrue) with Ptrue representing the true precision matrix. This formulation inherently accounts for the missingness mechanism, typically modeled as MCAR with a missing rate of 30%, and is further stabilized via techniques such as log-sum-exp and gradient clipping. Our contributions include the integration of a meta-learning prompt generator, which dynamically selects key hyperparameters (e.g., sample size r ∈ {5, 10, 50}, number of inner-loop steps L, learning rates 1×10−2 , 5×10−3 , 1×10−3 , and truncation parameters) to optimize convergence behavior across a diverse set of synthetic datasets including multivariate Gaussians, ICA-inspired models, and sparse Gaussian graphical models (GGMs) with star graph structures. Experimental results demonstrate significant improvements: for instance, in the Gaussian experiment the loss reduced from 9.687 at iteration 50 to 0.094 at iteration 300 and the corresponding parameter error decreased from 3.033 to approximately 2.030, while in the GGM case, the ROC AUC improved from 0.219 to 0.97, thereby confirming our method’s efficacy in both parameter estimation and structure recovery under partial observations. These empirical validations underscore the relevance of adaptive score matching in high-dimensional and complex data regimes, set against the inherent difficulties of handling missing data and ensuring numerical stability in the estimation process, and pave the way for future extensions to accommodate MNAR scenarios and diffusion-based denoising score matching frameworks.
-
2510.0058ViewAdaptive Inference Strategies for Token-OrderingAAdaptive token-ordering strategies for masked diffusion models (MDMs) and autoregressive models (ARMs) are critical for addressing the inherent imbalance in subproblem difficulties during sequence generation, which becomes increasingly relevant as models scale to complex reasoning tasks. In this work, we tackle the challenge of dynamically adjusting the token generation order via a reinforce- ment learning framework that optimizes the cumulative predictive V-information,formally defined as I_V (X → Y ) = HV (Y |∅) − HV (Y |X), to preferentially solve easier subproblems first. Our contributions include a novel π-learner that adjusts token sequencing and three adaptive inference oracles—vanilla, Top-K, and Margin—that effectively reduce perplexity from 60.0 to 52.0 while preserving token diversity (entropy shifting from 4.8 to 4.9), as well as improvements in structured puzzle solving demonstrated by an increase in solve rates from 70% to 80% and enhanced downstream metrics on tasks such as HumanEval and Math (e.g., pass@1 scores improving from 60% to 66%). Experimental validation spans scaling law analyses, where validation NLL drops from approximately +3.0 at 109 FLOPs to −5.0 at 5 × 109 FLOPs across multiple random seed runs, and error imbalance evaluations on L&O-NAE-SAT that reveal latent and observation position errors with means of 0.7976 and 0.9724, respectively. Collectively, these results confirm that adaptive token ordering not only mitigates computational intractability in hard token predictions but also enhances both likelihood-based metrics and generalization performance over fixed ordering strategies.
-
2510.0059ViewAdaptive AI Governance: Mitigating Income Inequality through Predictive Analytics and Dynamic Policy FrameworksThe paper addresses the critical issue of AI-induced income inequality, focusing on developing an adaptive AI governance model that integrates real-time data analytics and local economic contexts to mitigate labor market disruptions. As AI technologies rapidly transform global labor markets, they pose a significant risk of job displacement and income disparity, necessitating adaptable governance frameworks. The challenge lies in creating a globally applicable model that accurately reflects diverse economic environments, predicts AI's long-term impacts, and balances innovation with worker protection. Our proposed solution is a sophisticated predictive analytics platform employing machine learning, Monte Carlo simulations, and agent-based modeling to simulate AI adoption scenarios and their effects on labor markets. Experiments utilizing a shallow MLP architecture on the \texttt{ag\_news} dataset demonstrate consistent prediction accuracy, with Mean Absolute Error (MAE) values ranging from 0.2518 to 0.2849, although R-squared scores were negative, indicating limitations in data representation. The main contributions of this study include a novel governance model that anticipates and mitigates AI's socio-economic impacts, offering dynamic policy recommendations tailored to local conditions. This research provides a foundation for future work on enhancing model accuracy and applicability by incorporating more comprehensive datasets and complex architectures.
-
2510.0060ViewRevolutionizing AI Conference Peer Review: A Bi-Directional Feedback and Rewards FrameworkThe rapid increase in submissions to AI conferences has led to a crisis in the peer review process, characterized by declining review quality and accountability. This position paper proposes a novel bi-directional feedback mechanism where authors can evaluate the quality of reviews while safeguarding against retaliation. Cou- pled with a blockchain-enabled reviewer rewards system, this framework aims to incentivize high-quality reviewing and create an accountability structure that ben- efits all stakeholders. By allowing authors to provide feedback on reviews and rewarding reviewers with transparent digital credentials, this system fosters a cul- ture of quality and responsibility in the peer review process. We call upon the AI community to engage in this vital conversation and explore these transformative reforms for sustainable peer review practices.
-
2510.0061ViewReimagining AI Safety: A Pro-Worker Framework for the Future of WorkThe rapid increase in submissions to AI conferences has led to a crisis in the peer review process, characterized by declining review quality and accountability. This position paper proposes a novel bi-directional feedback mechanism where authors can evaluate the quality of reviews while safeguarding against retaliation. Cou- pled with a blockchain-enabled reviewer rewards system, this framework aims to incentivize high-quality reviewing and create an accountability structure that ben- efits all stakeholders. By allowing authors to provide feedback on reviews and rewarding reviewers with transparent digital credentials, this system fosters a cul- ture of quality and responsibility in the peer review process. We call upon the AI community to engage in this vital conversation and explore these transformative reforms for sustainable peer review practices.
-
2510.0062ViewReimagining AI Safety: A Pro-Worker Framework for the Future of WorkAs artificial intelligence, particularly generative AI, continues to reshape labor markets, traditional AI safety frameworks prioritize existential and technical risks while overlooking critical human-centric challenges. This position paper advo- cates for a paradigm shift towards a pro-worker governance framework that ad- dresses the systemic risks posed by AI on economic justice and labor rights. We identify six key risks, including the exacerbation of technical debt, disproportion- ate job displacement, and the monopolistic tendencies of AI firms. By propos- ing actionable interventions such as collective licensing for AI-generated content, mandatory AI watermarking, and robust retraining policies, we aim to enhance the resilience of labor markets. This paper calls for an inclusive dialogue among stakeholders, emphasizing the need for policies that not only safeguard against the adverse effects of AI but also promote shared prosperity. Our framework aims to establish a sustainable relationship between AI and labor that empowers workers and fosters equitable growth.
-
2510.0063ViewDynamic Intent Adaptation for Long-Term Dialogue Systems Using Reinforcement LearningThis paper addresses the challenge of enabling large language models (LLMs) to dynamically discover and adapt to user intents during long-term interactions. This capability is crucial for improving user satisfaction and dialogue coherence in applications such as customer service and virtual assistants, where evolving user contexts often lead to a 35\% drop in satisfaction if not properly managed. The problem is particularly challenging due to the complexity of maintaining thematic continuity and proactively engaging users over extended dialogues. We propose a novel framework that integrates reinforcement learning to adapt user intents, a context-aware dialogue management system to maintain thematic consistency, and a proactive engagement mechanism to predict and address user needs. Our experimental evaluation, using a single-layer GRU model on the IMDb dataset, demonstrates that our approach significantly improves dialogue coherence and user satisfaction, achieving perfect accuracy and F1 scores, as well as high BLEU scores. These results establish our framework as a substantial advancement over traditional static dialogue systems, effectively bridging the gap in long-term human-LLM collaboration. Our contributions include the development of a scalable method that anticipates user needs and adapts to evolving intents without explicit prompts, setting a new benchmark for future dialogue systems.