Papers
Event:
-
2510.0056ViewEnsemble-Based Bayesian Aggregation with Uncertainty-Guided Clarifications for Multi-Turn Human-LLM CollaborationOur work addresses the challenge of optimizing long-term multiturn human–LLM collaboration by introducing an ensemble of Monte Carlo-based reward predictors, Bayesian meta-calibration, and an uncertainty-guided clarification module that dynamically triggers clarifying interactions; in particular, we estimate the conversation-level reward as R∗ (t|g) = Rext (t, g) + Rint (t), where Rext (t, g) quantifies task-specific success (e.g. BLEU scores reaching up to 80% in document editing and unit test pass rates near 70% in code generation) and Rint (t) incorporates an efficiency penalty defined as − min[λ · TokenCount(t), 1] with λ = 0.01, augmented by an LLM-based interactivity score; our approach further employs Bayesian linear regression to aggregate the ensemble signals into a unified reward while simultaneously providing an uncertainty metric which, if exceeding a predefined threshold (e.g., 0.15), triggers an auxiliary clarification round that improves the aggregated outcome—this mechanism is mathematically formulated and empirically validated through improvements such as an increase in accuracy from 73.9% to 79.9% in mathematical problem solving and a resolution of ambiguous dialogue from 80% to 100% as reflected in our experiments; challenges arise due to noisy reward estimations and the trade-off between immediate task performance and long-term conversational quality, which we address via extensive ablation studies on window sizes (with w ∈ {1, 2, 3}) and Monte Carlo sample counts (e.g. S ∈ {3, 5}), as summarized in Table 1 (e.g., MediumDocEdit-Chat: BLEU 0.625 → 0.637, BigCodeBench-Chat: Unit Test Pass Rate 0.532 → 0.489, MATH-Chat: Accuracy 0.739 → 0.799, Abg-CoQA: Macro Accuracy/F1 0.8 → 1.0); overall, this work contributes a robust framework that integrates ensemble learning, uncertainty estimation, and dynamic clarification to effectively enhance the collaborative potential between human users and language models in complex, multi-turn settings.
-
2510.0055ViewQuantifying the Trade-Offs in Policy EvaluationThis work presents a comprehensive framework for quantifying the trade-off between prediction accuracy and screening access in policy evaluation, where we address the challenge of identifying and targeting the worst-off individuals through the rigorous estimation of a policy value function defined as V (α, β, R2 ) = √ Φ2 (zα ,zβ ;ρ)/β, with zα = Φ−1 (α), zβ = Φ−1 (β), and ρ = R2 ; our approach introduces the Prediction-Access Ratio (PAR) as a metric to quantify the rela tive impact of finite improvements in screening thresholds versus enhancements in predictive accuracy, thereby overcoming challenges associated with non-linear sensitivities such as ∂V/∂α ≈ 1.77513 AND ∂V/∂R2 ≈ 0.61282. We verify our framework using extensive simulation experiments on synthetic datasets in which a complex model’s Test R2 improves from 0.16866 to 0.32661 through residual scaling with δ = 0.1 and an associated empirical policy value V (α, β) increases from 0.70000 to 0.80000; and are further supported by capacity gap analyses which demonstrate that a minimal additional screening increment, ∆α∗ ≈ 0.0300, can yield gains comparable to those from complex model enhancements; this integrated strategy thereby provides actionable insights for policy interventions aimed at equalizing access while maintaining efficiency, a pertinent issue given the inherent difficulties arising from the interplay between prediction improvement and screening capacity in heterogeneous populations.
-
2510.0054ViewExplorations in Algorithmic Creativity via Next-Token and Multi-Token ApproachesAlgorithmic creativity in text generation poses significant challenges in balancing coherence, diversity, and memorization, and our study addresses these challenges by systematically comparing traditional next-token prediction (NTP) with multi-token teacherless prediction (MTP) and discrete diffusion methods (SEDD) across minimal yet representative combinatorial tasks such as Sibling Discovery, Triangle Discovery, Circle Construction, and Line Construction; our primary objective is to maximize the creative output defined as the fraction of generated samples that satisfy task-specific outputs validity criteria, which we quantify as ĉr = #coherent/#total outputs, and to minimize memorization, observed to drop from 100% under deterministic conditions to near 0% when employing controlled stochastic, while diversity is measured by D = |{unique-outputs}|total outputs with values reaching up to 1.00 in optimized settings; to achieve these ends, we introduce seed-conditioning and temperature scaling—modeled by the parameter T where T = 0 corresponds to greedy decoding and T > 0 introduces controlled noise following the relation pnoise = min(0.9, α × T ) with α varying by method—to guide the output generation process, and we formulate an alignment loss to ensure semantic consistency between the restrictive and adaptive prompts; extensive experimentation and rigorous ablation studies, as summarized in Table 1 (detailing coherence rates between 50% and 80%, memorization rates dropping from 100% to nearly 0%, and diversity metrics peaking at 1.00), validate that both MTP and SEDD outperform NTP under non-deterministic settings and when augmented with seed-conditioning, thereby demonstrating that our hybrid framework not only pushes the boundaries of algorithmic creativity on minimal open-ended tasks but also offers a scalable approach for more complex problem domains.
-
2510.0053ViewChatGPT Event Labor Impact Simulation via Two-Stage Dynamic Prompt TuningIn this work, we propose a scalable framework to simulate the labor market impacts of the ChatGPT event using a two-stage dynamic prompt tuning mechanism combined with an LLM-based qualitative classifier; our objective is to operationalize labor displacement signals (P1) alongside shared prosperity (P3) and detectability (P6) by addressing the inherent challenges of dynamic prompt adaptation and qualitative taxonomy mapping. We tackle the complexity of evolving labor market signals through gradient-based meta-learning updates, modeled as ∆s = αst−1 + ϵ, and employ a Difference-in-Differences regression of the form Yit = β0 + β1 Treatmentit + γi + δt + εit to quantify the impact on employment metrics, notably obtaining a significant negative treatment coefficient of approximately −5.71 (with p < 0.001). Our qualitative classifier achieves a robust accuracy of 74.97% in mapping job narratives to six predefined propositions, and supplemental analyses—such as a principal component analysis (PCA) yielding an AI Capacity Index and its near-zero correlation (r ≈ 0.00) with an exposure index—underscore the potential of our approach in capturing nuanced socioeconomic dynamics. Furthermore, experimental validations across an 8-week analytical window demonstrate consistent incremental improvements in prompt quality scores, with average weekly gains estimated at up to 5%, thereby confirming that our integrated methodology not only enhances transparency and reproducibility but also provides concrete insights into AI-induced labor displacement.
-
2510.0052ViewConformal Prediction as Bayesian Quadrature for Risk ControlIn this paper, we present a novel framework that leverages Bayesian quadrature for conformal prediction to achieve rigorous, data-conditional, and distribution-free risk guarantees, addressing the challenge of controlling predictive risk in high-stakes, black-box settings. Our approach constructs an upper bound on the expected loss by integrating over the quantile function of the loss distribu- tion, where, given calibration losses ℓ1 , . . . , ℓn , we define the aggregated loss Pn+1 as L+ = i=1 Ui ℓ(i) with Dirichlet random variables Ui ∼ Dir(1, . . . , 1) and ℓ(n+1) = B, thereby ensuring that the condition Pr(L+ ≤ α) ≥ β is met. Our contributions include a principled derivation that recovers well-known conformal methods such as Split Conformal Prediction (SCP) and Conformal Risk Control (CRC) as special cases, while introducing a novel high posterior density (HPD) rule that exploits the full posterior of L+ . We rigorously validate our method on synthetic binomial loss and heteroskedastic regression tasks, where experimental results indicate that methods based solely on the posterior mean (CRC) or uniform concentration bounds (RCPS) often yield either overly optimistic or conservative decisions, whereas our HPD rule achieves risk control with zero empirical failure rate and improved utility. For example, in the binomial experiment, while SCP selects an average λ of 0.596 with a 61.6% failure rate, HPD selects λ ≈ 0.970 with a 0% failure rate, and a similar trend is observed in regression tasks with test risks decreasing from 0.512 for SCP to 0.067 for HPD. These findings, summarized in Table 1, confirm that our Bayesian quadrature reformulation not only provides a more interpretable statistical characterization of conformal risk but also adapts effectively to calibration sample size and confidence level tuning, thus offering a robust solution for high-stakes decision-making.
-
2510.0051ViewCOMD: Coherent Masked DiffusionMasked language models (MLMs) have shown promise in natural language processing, but struggle with generating coherent and coherent-sounding text. In this work, we present Coherent Masked Diffusion (CoMD), a novel framework that extends Masked Language Diffusion to more efficiently and more effectively learn coherent and incoherent language. CoMD is built on Masked Language Diffusion (MLD), a recently proposed framework that models text generation as an inverse denoising diffusion process. Unlike MLD, CoMD uses a fixed mask matrix that is independent of the masked-out token and optimizes the probability of coherent generations with a novel coherent loss term without requiring additional samples per training step. Additionally, CoMD uses a variable time parameter to guide the coherent probability towards the ground truth coherent probability. Both inference and training computation are constant with respect to the length of the text. Empirically, CoMD outperforms previous methods on multiple coherent benchmarks. Furthermore, CoMD achieves an inference speedup of 7.3x and 10.5x over MLD and MDLM, respectively, and is significantly more compute and parameter efficient than autoregressive models.
-
2510.0050ViewU-CAN: User-Guided Clarification for Asking Clarification in Asking Across Needs FrameworkIt is still unclear if and how methods developed specifically on asking clarification for retrieval or problem-solving in the academic community can effectively address user needs during human-computer interactions (HCI). In this work, we first propose an Asking Across Needs (AAN) framework to explore the complexities of HCI, including user needs, interaction styles, and interaction types, by building an interaction graph (Pearl, 2009) containing user and LLM actions. Then, we create a new benchmark, UsClarification for Asking Needs (U-CAN), containing task-oriented asking clarification and retrieval-related asking clarification which align with real-world HCI scenarios. Specifically, we design new interaction graph designs and user-guided prompting techniques based on our AAN framework to address multiple user needs not met in existing HCI studies. We find that task-oriented needs are often left unmet, and existing methods show performance gaps between simulated and real-world (enrolled students) settings. We also demonstrate that HCI can be facilitated by interaction graphs on retrieval-related asking clarification using our proposed interactive graph model.
-
2510.0049ViewLearning Unnormalized Models with Missing Data via Adversarial Score MatchingLearning unnormalized model parameters is a challenging task that frequently arises in various scientific fields. Score matching is a promising method to learn unnormalized models by estimating the score function. However, score matching has several practical challenges in real-world applications, including the need for an auxiliary network to estimate the score function, the requirement for the model to support sampling, and the difficulty of estimating the score function for high-dimensional data. To address these challenges, we propose adversarial score matching (ASM), an adversarial learning algorithm for learning unnormalized models, which does not require an auxiliary network and can be applied to high-dimensional data. We also propose a multilevel Monte Carlo estimator for the score discrepancy, which is computationally more efficient than the traditional importance sampling estimator. In addition, we demonstrate that ASM is a mode-seeking algorithm, which has been observed empirically in a variety of adversarial learning methods. We evaluate the performance of ASM on various unnormalized models and missing data mechanisms, and demonstrate that ASM outperforms existing score matching methods.
-
2510.0048ViewRisk Control With Width-Sketching AlgorithmsWe introduce the notion of width-sketching algorithms, defined as algorithms with provably bounded width (that is, probability of containing the randomness) for the induced coverage set. For algorithms that sketch the width, we prove a novel uniform upper bound and provide an instance where the width in expectation is twice as large as the optimal width. We then introduce the width-optimality notion and an approximate version termed mean-width optimality, which allows us to derive algorithms with the desired coverage while minimizing the mean width. We provide a high-level perspective on the relationship with depth-sketching algorithms, i.e., algorithms that sketch the depth of the induced sets with probability 1 − α, and show that they provide complementary forms of coverage. Finally, we demonstrate the application of the framework to conformal prediction with Bayesian quadrature.
-
2510.0047ViewPredictive Need Assessment, Public Service Providers and Inequalities of Labor Market OutcomesAid and assistance are key to reducing social inequalities. In the public sector, aid providers face the challenge to distribute resources according to needs. In recent years, algorithms for need assessment have become an integral part of public institutions. While need assessment is crucial, the implications for the distribution of aid and assistance in the public sector are still poorly understood. In this work, we investigate how the use of predictive models for need assessment impacts the distribution of aid and assistance in the German public employment service. To this end, we develop a synthetic dataset for the “first round” assignment in the German public employment service based on regional data from the State of Bavaria in 2019. Our dataset comprises labels for 85,299 out of 275,889 employed and unemployed, treated and untargeted individuals in 2019. The label indicates whether a person received a prioritized status and thus privileged access to government services. We find that the use of predictive models leads to significant resource imbalances and deepens the divides along the lines of migration background, education, and gender. These findings highlight important ethical implications for public service providers in need of need assessment for aid and assistance.
-
2510.0046ViewEconomic Implications of Language Models and Copyright LawHow will language models (LMs) affect future economic progress? Inspired by the Lever of Riches by Mokyr (1992), we argue that the institutions governing LM content generation and usage patterns are critical to answering this question. We content that, because LM creators have a strong incentive to collect, train, and deploy intellectual property protection, the all-you-can-consume access to knowledge and creativity they enable has led to rapid acceptance and widespread use, which in turn results in smaller, low-skill employment creation but increased output and greater overall welfare. We provide a theoretical and analytical framework explaining this phenomenon and point to its long-term consequences using empirical evidence.
-
2510.0045ViewPST-AUTO-AGENT: A Multi-Agent Ensemble Framework for Paper Source TracingThe escalating volume of scientific literature necessitates efficient methods for identifying foundational works that significantly inform new research. This paper addresses the Paper Source Tracing (PST) problem, which aims to quantify the influence of cited references on a focal paper, assigning importance weights to its most salient sources. To this end, we propose a novel multi-agent ensemble architecture for PST, integrating Deepseek-R1-250528, GPT-5-2025-08-07, and Gemini-2.5-pro. Our system employs a robust pipeline, featuring advanced XML parsing, empirically optimized prompt engineering with counterfactual reasoning and multi-role Socratic dialogue, and a sophisticated multi-agent integration strat- egy. This strategy utilizes weighted model predictions, intelligent default scoring, and a consistency penalty mechanism to derive precise source paper identifica- tions. Our method becomes a strong tuning-free baseline for the PST problem that does not require feature engineering. Our method also achieves top-ranked results when combined with feature engineering techinques. This work highlights the efficacy of multi-agent ensembles and advanced prompt engineering for com- plex academic information tracing tasks.
-
2510.0044ViewA Comprehensive Survey on Deep LearingMachine learning and deep learning methodologies have revolutionized computational approaches to complex problem-solving across numerous domains, emerging as transformative technologies in artificial intelligence research [1,7]. This comprehensive review synthesizes current literature to examine the theoretical foundations, methodological advancements, and practical implementations of these techniques, highlighting their evolution from basic machine learning concepts to sophisticated deep neural architectures [2,9]. The analysis demonstrates remarkable success in applications ranging from computer vision and natural language processing to healthcare diagnostics and autonomous systems, with deep learning models achieving unprecedented performance in pattern recognition tasks [3,4,8]. However, significant challenges persist, including the need for massive labeled datasets, computational resource requirements, model interpretability issues, and inherent parameter redundancy in deep architectures [5,6]. The review identifies emerging opportunities in transfer learning, few-shot learning, and explainable AI as promising research directions [10]. By critically evaluating both current limitations and future potential, this analysis provides a structured framework for researchers to advance the field while addressing practical implementation barriers across diverse application domains.
-
2510.0043ViewDecoupling Openness and Connectivity: Non-Monotonic Effects in LLM-Based Cultural DynamicsCultural dynamics in multi-agent systems exhibit a counterintuitive phenomenon: local similarity-based interactions can lead to global fragmentation rather than convergence. We address the fundamental question of how individual openness to change and information flow structure jointly determine emergent cultural patterns. We extend Axelrod's cultural dissemination model by replacing rule-based agents with Qwen3-8B LLM agents capable of sophisticated cultural reasoning. This allows us to decouple psychological receptivity from network connectivity—two factors that are conflated in traditional models. Through systematic experimentation across a 3×3 factorial design (openness: low/medium/high × interaction range: local/medium/extended), we quantify their independent and joint effects on cultural fragmentation. Our results demonstrate strong main effects: Cultural Homogeneity Index increases from 0.279 to 0.437 with higher openness (1st order interactions, +57\%), while optimal information flow (3rd order) achieves the highest convergence at 0.489 for high openness agents—representing 75\% improvement over low openness baseline (0.279). Critically, we uncover a non-monotonic relationship where 3rd-order interactions consistently outperform both 1st and 5th-order across all openness levels, revealing an optimal balance between exploration and exploitation. Code can be found at https://anonymous.4open.science/r/YuLan-OneSim/.
-
2510.0042ViewICIMBench: An In-Context Iterative Molecular Design Benchmark for Large Language ModelsLarge language models (LLMs) are rapidly transforming scientific discovery, showing promise in hypothesis generation, literature understanding, and symbolic reasoning. Yet, their capacity to conduct iterative, feedback-driven molecular design---a hallmark of real-world drug and materials discovery---remains underexplored. Existing benchmarks typically cast molecular tasks as one-shot question-answering or text-to-molecule translation, neglecting the iterative propose-evaluate-refine process central to scientific practice. We propose \textbf{ICIMBench}, an \textit{In-Context Iterative Molecular Design Benchmark} that evaluates LLMs in multi-turn molecular design episodes. In each task, the model receives a natural-language specification, generates candidate molecules in SMILES format, and iteratively refines them based on deterministic oracle feedback from RDKit. We introduce the \textbf{NumEval} metric---the number of evaluations required to satisfy the target---which captures both performance efficiency and robustness under realistic evaluation budgets. Experiments on frontier models (GPT-5, DeepSeek-V3.2, Intern-S1) show that while single-property design is largely solved (NumEval $=1$) by state-of-the-art LLMs like GPT-5, multi-property optimization remains a strong challenge, especially under coupled constraints such as lipophilicity and scaffold similarity. ICIMBench provides a principled framework for probing the in-context reasoning and adaptive optimization abilities of LLMs, paving the way toward autonomous, language-driven molecular discovery.
-
2510.0041ViewGraph neural network for colliding particles with an application to sea ice floe modelingThis paper introduces a novel approach to sea ice modeling using Graph Neural Networks (GNNs), utilizing the natural graph structure of sea ice, where nodes represent individual ice pieces, and edges model the physical interactions, including collisions. This concept is developed within a one-dimensional framework as a foundational step. Traditional numerical methods, while effective, are computationally intensive and less scalable. By utilizing GNNs, the proposed model, termed the Collision-captured Network (CN), integrates data assimilation (DA) techniques to effectively learn and predict sea ice dynamics under various conditions. The approach was validated using synthetic data, both with and without observed data points, and it was found that the model accelerates the rendering of trajectories without compromising accuracy. This advancement offers a more efficient tool for forecasting in marginal ice zones (MIZ) and highlights the potential of combining machine learning with data assimilation for more effective and efficient modeling.
-
2510.0040ViewA Fuzzy-based Approach to Predict Human Interaction by Functional Near-Infrared SpectroscopyIn this article, we introduce the Fuzzy logic-based attention (Fuzzy Attention Layer) mechanism, a novel computational approach designed to enhance the interpretability and efficacy of neural models in psychological research. The fuzzy attention layer integrated into the transformer encoder model to analyze complex psychological phenomena from neural signals captured by functional near-infrared spectroscopy (fNIRS). By leveraging fuzzy logic, the fuzzy attention layer learns and identifies interpretable patterns of neural activity. This addresses a significant challenge in using transformers: the lack of transparency in determining which specific brain activities most contribute to particular predictions. Our experimental results, obtained from fNIRS data engaged in social interactions involving handholding, reveal that the fuzzy attention layer not only learns interpretable patterns of neural activity but also enhances model performance. In addition, these patterns provide deeper insights into the neural correlates of interpersonal touch and emotional exchange. The application of our model shows promising potential in understanding the complex aspects of human social behavior, verify psychological theory with machine learning algorithms, thereby contributing significantly to the fields of social neuroscience and AI. Presented version based on the work published in IEEE TFS (2025)
-
2510.0039ViewUncertainty Quantification in Machine Learning for Responsible AIMachine learning and artificial intelligence will be deeply embedded in the intelligent systems humans use to automate tasking, optimize planning, and support decision-making. We present a critical review of uncertainty quantification (UQ) in large language models (LLMs), synthesizing insights from over 80 papers across leading venues (ACL, ASE, NeurIPS, ICML, AAAI, IJCAI, Nature, and others). We introduce UQ-Net, a unified probabilistic framework that combines Bayesian modeling, calibration, conformal prediction, and selective decision rules to disentangle epistemic and aleatoric uncertainty and to support reliable decision thresholds. UQ-Net integrates uncertainty estimates with calibration procedures and anomaly detection to enable safer selective deployment of LLM agents. Through case studies in medical diagnosis and code generation, we demonstrate that UQ-Net improves calibration and reduces predictive error by 15–20% relative to standard baselines. We survey existing evaluation practices and identify critical gaps: misalignment of consistency and entropy with factuality, lack of benchmarks for multi-episode interactions, and inconsistent metrics for calibration and tightness. We advocate for context-aware datasets, standardized metrics, and human-in-the-loop evaluations to better align UQ methods with deployment needs. Our review and proposed framework offer a principled foundation for operationalizing UQ in LLMs, advancing the development of trustworthy, responsible agentic AI for safety-sensitive, real-world applications.
-
2510.0038ViewThe Hitchhiker's Guide to Autonomous Research: A Survey of Scientific AgentsThe advancement of LLM-based agents is redefining AI for Science (AI4S) by enabling autonomous scientific research. Prominent LLMs exhibited expertise across multiple domains, catalysing constructions of domain-specialised scientific agents. Nevertheless, the profound epistemic and methodological gaps between AI and the natural sciences still impede the systematic design, training, and validation of these agents. This survey bridges the existing gap by presenting an exhaustive blueprint for scientific agents, spanning systematic construction methodologies, targeted capability enhancement, and rigorous evaluations. Anchored in the canonical scientific workflow, this paper (i) pinpoints the overview of scientific agents, starting with the development from general-purpose agents to scientific agents driven by articulated goal-orientation, then subsequently advancing a comprehensive taxonomy that organises existing agents by construction strategy and capability scope, and (ii) introduces a two-tier progressive framework, from scientific agents contrustion from scratch to targeted capability enhancement, for realizing autonomous scientific research. It is our aspiration that this survey will serve as guidance for researchers across various domains, facilitating the systematic design of domain-specific scientific agents and stimulating further innovation in AI-driven scientific research. To support long-term progress, we curate a live repository (\href{https://github.com/gudehhh666/Awesome_Scientific_Agent.git}{\textsc{Awesome\_Scientific\_Agent}}) that continuously aggregates emerging methods, benchmarks, and best practices.
-
2510.0037ViewState-Dependent Dynamics Among Apple Stock, Bitcoin, and Gold: Evidence from Rolling Correlations, Connectedness, and Tail CopulasMega-cap technology equities, cryptocurrencies, and gold increasingly co-determine modern portfolio outcomes, yet their joint dynamics are regime-dependent and incompletely understood. This paper studies the state-dependent comovement among Apple Inc. (AAPL), Bitcoin (BTC), and gold (XAU) using daily data from 2015 to 2025. We triangulate three complementary lenses: (i) rolling Pearson correlations to trace smooth co-movement and structural shifts; (ii) a Diebold–Yilmaz connectedness framework based on rolling 252-day VARs and generalized forecast-error variance decompositions to identify directional risk transmitters and receivers; and (iii) empirical copulas to estimate lower- and upper-tail dependence at the 5% threshold, contrasting crash versus rally dynamics. We document three core results. First, the AAPL–BTC correlation rose from near zero before 2020 to roughly 0.30 during the COVID-19 period and has remained persistently elevated through 2025, indicating a lasting post-pandemic regime. Second, gold emerges as a net transmitter of shocks while BTC is a net receiver, revising the canonical view of gold as a purely passive safe haven. Third, AAPL–BTC exhibits pronounced asymmetry in the tails, with downside dependence considerably stronger than upside (“crash correlation”). Robustness checks spanning window sizes, VAR lags, alternative dependence metrics, and tail thresholds corroborate these findings. The portfolio implication is that static diversification across tech, crypto, and gold underperforms precisely when insurance is most needed. We advocate regime-aware allocation that monitors connectedness and transmitter identity, budgets explicitly for joint-tail risk, and uses dynamic overlays. The results also inform macro-prudential monitoring: supervisors should track transmitter rotations and connectedness spikes that presage cross-market stress transmission.