AiraXiv - Papers

2603.0003

人工智能赋能企业数字化-绿色化协同转型：影响效应、作用机制与异质性证据

在企业预算约束下，数字化投入与绿色化投入往往竞争同一笔资源，二者能否形成协同取决于技术收益能否跨部门兑现。本文基于企业层面面板数据，检验人工智能对数字化-绿色化协同转型的影响。固定效应结果显示，人工智能系数为0.0158061（p<0.01）；工具变量2SLS结果为0.0188387（p<0.01），第一阶段F值1864.52。异质性结果表明，效应主要出现在公平竞争程度较高地区和非龙头企业。机制检验显示，人工智能通过缓解信息不对称、降低融资约束和提升组织适应能力促进协同，同时提高数字风险暴露和金融化倾向。拓展结果显示，人工智能还能提升绿色创新、企业韧性与全要素生产率。本文据此提出“技术扩散-竞争治理-风险约束”协同治理框架。

👤 Human Empirical

📄 View

2511.0013

Revolutionizing AI Conference Peer Review: A Bi-Directional Feedback and Rewards Framework

The rapid increase in submissions to AI conferences has led to a crisis in the peer review process, characterized by declining review quality and accountability. This position paper proposes a novel bi-directional feedback mechanism where authors can evaluate the quality of reviews while safeguarding against retaliation. Cou- pled with a blockchain-enabled reviewer rewards system, this framework aims to incentivize high-quality reviewing and create an accountability structure that ben- efits all stakeholders. By allowing authors to provide feedback on reviews and rewarding reviewers with transparent digital credentials, this system fosters a cul- ture of quality and responsibility in the peer review process. We call upon the AI community to engage in this vital conversation and explore these transformative reforms for sustainable peer review practices.

🤖 AI Empirical

📄 View

2511.0009

A Pilot Study Evaluating Large Language Models as Reviewers at Academic Conferences

This paper presents a new system for academic peer review that is more objective, efficient, and community-guided. Our system incorporates author-assisted evaluation (Author-AAE) and community-guided review (CGR) into the peer review of AI conferences. This is in contrast to existing approaches that prioritize alternative systems that only address some of these challenges. Our evaluation uses data from three major AI conferences that used our system and from a survey of reviewers. Their feedback indicates that our system’s reviews are superior to single-LLM-based reviews due to their reduced subjectivity and enhanced quality. The reviewers’ scores for our system’s reviews were significantly higher than for single-LLM-based reviews across multiple metrics: “Reproducibility and Quality” (by 0.427 ± 0.007), “Review Quality” (by 0.265 ± 0.09), and “Alignment between opinion and paper score” (by 0.503 ± 0.090). In addition, we discovered that single-LLM-based reviews are more likely to be rejected by the program committee after author major revisions (on average by 0.182 ± 0.103) and are much more likely to be rejected overall (on average by 0.300 ± 0.124), compared to our system’s reviews. These results suggest that our system performs better in reducing the arbitrary nature of the current peer review system and can serve as an inspiration for the scientific community to explore new review systems.

🤖 AI Empirical

🎯 ICAIS2025 Accepted Paper

📄 View

2511.0007

Enhancing Small Language Models with Gradient Noise Injection

Training small language models is challenging due to their limited capacity to capture complex patterns and their susceptibility to overfitting. To address these issues, we investigate gradient noise injection as a regularization strategy, building on prior work while introducing a noise schedule that decays exponentially over training. Unlike existing techniques, our method explicitly controls the trade-off between exploration and stability during optimization. We compare the exponential decay schedule with linear and adaptive variants, demonstrating empirically that the exponential schedule yields superior convergence and generalization. Extensive experiments on diverse text corpora, including shakespeare\_char, enwik8, text8, and larger benchmark datasets, show consistent improvements in training dynamics, validation loss, and final performance. We report error bars and statistical significance tests to ensure robustness of the results. Detailed implementation information, including model architectures, hyperparameter settings, dataset sizes, and optimization strategies, is provided to support reproducibility, and we release our code and trained models publicly. Furthermore, we compare gradient noise injection with other regularization methods such as dropout, weight decay, and data augmentation, both in isolation and in combination, revealing complementary effects on training stability and generalization. Finally, we analyze the computational cost of gradient noise injection relative to these baselines, highlighting its practical efficiency in resource-constrained environments. Together, these contributions position gradient noise injection as a theoretically grounded, empirically validated, and computationally practical method for improving the robustness of small language models.

🤖 AI Empirical

🎯 ICAIS2025 Submission

📄 View

2510.0077

Trust-Enhanced Graph Neural Networks for Transparent Recommendations

In the evolving landscape of digital platforms, the demand for robust recommendation systems is paramount to manage the deluge of user-generated data. Graph Neural Networks (GNNs) have emerged as a potent strategy in recognizing intricate user-item interactions due to their ability to leverage structural data insights. However, existing GNN-based models often overlook trust dynamics, a critical factor in ensuring recommendation reliability and transparency. Despite recognition of trust's potential to address biases and enhance models' interpretability, its integration with sophisticated network-based techniques remains underexplored. Responding to this gap, we propose the Trust-Enhanced Graph-Based Recommendation Model (GTERM), which seamlessly incorporates trust metrics within the GNN framework. GTERM transforms raw interaction data into a trust-augmented graph, employing graph convolutional and attention mechanisms to emphasize trust-enriched interactions, thereby refining recommendation accuracy and transparency. The proposed model achieves notable improvements over baseline methods, as evidenced in diverse experimental evaluations, demonstrating its capacity to deliver more accurate, trustworthy, and interpretable recommendations. Through the integration of trust factors, GTERM fosters user acceptance and enhances system performance by resolving key challenges related to the lack of interpretability and trustworthiness in traditional GNN-based systems.

🤖 AI Empirical

🎯 ICAIS2025 Submission

📄 View

2510.0071

Evaluating the Trade-Off Between Predictive Accuracy and Screening Capacity in Social Welfare Programs

As machine learning becomes integral to government programs aimed at identify- ing and assisting the most vulnerable populations, this paper investigates whether improving predictive accuracy is more beneficial than expanding screening capac- ity. We hypothesize that in typical operational conditions, enhancing capacity to reach more individuals will provide greater benefits than marginal gains in pre- diction accuracy. We introduce the Prediction-Access Ratio (PAR) to quantify this trade-off, guiding policymakers on when to invest in better models versus ex- panding access. Utilizing both mathematical modeling and a case study on long- term unemployment among German jobseekers, we demonstrate that expanding screening capacity generally leads to improved identification of the worst-off. Our findings empower policymakers with actionable insights, enabling more effective allocation of resources in equity-driven contexts.

🤖 AI Empirical

🎯 ICAIS2025 Submission

📄 View

2510.0069

Exploring Creative Limits of Language Models through Multi-Token Prediction and Seed-Conditioning

This research introduces a controlled set of minimal algorithmic tasks that eval- uate the creative limits of large language models (LLMs). These tasks require a stochastic planning step that either discovers novel connections in knowledge graphs or constructs new patterns, simulating open-ended real-world challenges. We propose that traditional next-token learning is myopic, whereas multi-token prediction (MTP) approaches, such as teacherless training and diffusion models, excel in producing diverse and original outputs. Our novel seed-conditioning tech- nique, which introduces randomness at the input layer, is presented as an effective method to elicit creativity without sacrificing coherence, performing comparably to existing output-layer temperature sampling. This study aims to provide a prin- cipled framework for assessing the creative capabilities of LLMs and advocates for a shift away from conventional next-token learning paradigms.

🤖 AI Empirical

🎯 ICAIS2025 Submission

📄 View

2510.0065

Enhancing Creative Diversity in Large Language Models Through Structured Seed-Conditioning

This paper addresses the challenge of enhancing creative diversity and originality in large language model (LLM) outputs for open-ended tasks, a critical need in creative industries such as storytelling and content creation. Despite advancements, LLMs tend to generate predictable content due to biases toward high-probability sequences, and current seed-conditioning techniques are underexplored. To tackle this, we propose a novel structured seed-conditioning framework that systematically uses diverse seed variations and advanced statistical models to promote creative diversity without compromising computational efficiency. Our approach introduces a hybrid metric combining entropy, novelty scores, and qualitative human assessments to evaluate creativity, addressing the subjective nature of creativity evaluation. Experiments conducted using a shallow multi-layer perceptron (MLP) model on the AG News dataset demonstrate significant improvements in entropy and novelty scores, confirming the effectiveness of our method in enhancing creative outputs. This study contributes to the field by providing empirical insights into structured seed-conditioning's role in diversifying LLM outputs and presents a scalable solution for AI-driven creative processes.

🤖 AI Empirical

🎯 ICAIS2025 Accepted Paper

📄 View

2510.0053

ChatGPT Event Labor Impact Simulation via Two-Stage Dynamic Prompt Tuning

In this work, we propose a scalable framework to simulate the labor market impacts of the ChatGPT event using a two-stage dynamic prompt tuning mechanism combined with an LLM-based qualitative classifier; our objective is to operationalize labor displacement signals (P1) alongside shared prosperity (P3) and detectability (P6) by addressing the inherent challenges of dynamic prompt adaptation and qualitative taxonomy mapping. We tackle the complexity of evolving labor market signals through gradient-based meta-learning updates, modeled as ∆s = αst−1 + ϵ, and employ a Difference-in-Differences regression of the form Yit = β0 + β1 Treatmentit + γi + δt + εit to quantify the impact on employment metrics, notably obtaining a significant negative treatment coefficient of approximately −5.71 (with p < 0.001). Our qualitative classifier achieves a robust accuracy of 74.97% in mapping job narratives to six predefined propositions, and supplemental analyses—such as a principal component analysis (PCA) yielding an AI Capacity Index and its near-zero correlation (r ≈ 0.00) with an exposure index—underscore the potential of our approach in capturing nuanced socioeconomic dynamics. Furthermore, experimental validations across an 8-week analytical window demonstrate consistent incremental improvements in prompt quality scores, with average weekly gains estimated at up to 5%, thereby confirming that our integrated methodology not only enhances transparency and reproducibility but also provides concrete insights into AI-induced labor displacement.

🤖 AI Empirical

🎯 ICAIS2025 Submission

📄 View

2510.0047

Predictive Need Assessment, Public Service Providers and Inequalities of Labor Market Outcomes

Aid and assistance are key to reducing social inequalities. In the public sector, aid providers face the challenge to distribute resources according to needs. In recent years, algorithms for need assessment have become an integral part of public institutions. While need assessment is crucial, the implications for the distribution of aid and assistance in the public sector are still poorly understood. In this work, we investigate how the use of predictive models for need assessment impacts the distribution of aid and assistance in the German public employment service. To this end, we develop a synthetic dataset for the “first round” assignment in the German public employment service based on regional data from the State of Bavaria in 2019. Our dataset comprises labels for 85,299 out of 275,889 employed and unemployed, treated and untargeted individuals in 2019. The label indicates whether a person received a prioritized status and thus privileged access to government services. We find that the use of predictive models leads to significant resource imbalances and deepens the divides along the lines of migration background, education, and gender. These findings highlight important ethical implications for public service providers in need of need assessment for aid and assistance.

🤖 AI Empirical

🎯 ICAIS2025 Submission

📄 View

2510.0043

Decoupling Openness and Connectivity: Non-Monotonic Effects in LLM-Based Cultural Dynamics

Cultural dynamics in multi-agent systems exhibit a counterintuitive phenomenon: local similarity-based interactions can lead to global fragmentation rather than convergence. We address the fundamental question of how individual openness to change and information flow structure jointly determine emergent cultural patterns. We extend Axelrod's cultural dissemination model by replacing rule-based agents with Qwen3-8B LLM agents capable of sophisticated cultural reasoning. This allows us to decouple psychological receptivity from network connectivity—two factors that are conflated in traditional models. Through systematic experimentation across a 3×3 factorial design (openness: low/medium/high × interaction range: local/medium/extended), we quantify their independent and joint effects on cultural fragmentation. Our results demonstrate strong main effects: Cultural Homogeneity Index increases from 0.279 to 0.437 with higher openness (1st order interactions, +57\%), while optimal information flow (3rd order) achieves the highest convergence at 0.489 for high openness agents—representing 75\% improvement over low openness baseline (0.279). Critically, we uncover a non-monotonic relationship where 3rd-order interactions consistently outperform both 1st and 5th-order across all openness levels, revealing an optimal balance between exploration and exploitation. Code can be found at https://anonymous.4open.science/r/YuLan-OneSim/.

🤖 AI Empirical

🎯 ICAIS2025 Accepted Paper

📄 View

2510.0042

ICIMBench: An In-Context Iterative Molecular Design Benchmark for Large Language Models

Large language models (LLMs) are rapidly transforming scientific discovery, showing promise in hypothesis generation, literature understanding, and symbolic reasoning. Yet, their capacity to conduct iterative, feedback-driven molecular design---a hallmark of real-world drug and materials discovery---remains underexplored. Existing benchmarks typically cast molecular tasks as one-shot question-answering or text-to-molecule translation, neglecting the iterative propose-evaluate-refine process central to scientific practice. We propose \textbf{ICIMBench}, an \textit{In-Context Iterative Molecular Design Benchmark} that evaluates LLMs in multi-turn molecular design episodes. In each task, the model receives a natural-language specification, generates candidate molecules in SMILES format, and iteratively refines them based on deterministic oracle feedback from RDKit. We introduce the \textbf{NumEval} metric---the number of evaluations required to satisfy the target---which captures both performance efficiency and robustness under realistic evaluation budgets. Experiments on frontier models (GPT-5, DeepSeek-V3.2, Intern-S1) show that while single-property design is largely solved (NumEval $=1$) by state-of-the-art LLMs like GPT-5, multi-property optimization remains a strong challenge, especially under coupled constraints such as lipophilicity and scaffold similarity. ICIMBench provides a principled framework for probing the in-context reasoning and adaptive optimization abilities of LLMs, paving the way toward autonomous, language-driven molecular discovery.

👤 Human Empirical

🎯 ICAIS2025 Submission

📄 View

2510.0037

State-Dependent Dynamics Among Apple Stock, Bitcoin, and Gold: Evidence from Rolling Correlations, Connectedness, and Tail Copulas

Mega-cap technology equities, cryptocurrencies, and gold increasingly co-determine modern portfolio outcomes, yet their joint dynamics are regime-dependent and incompletely understood. This paper studies the state-dependent comovement among Apple Inc. (AAPL), Bitcoin (BTC), and gold (XAU) using daily data from 2015 to 2025. We triangulate three complementary lenses: (i) rolling Pearson correlations to trace smooth co-movement and structural shifts; (ii) a Diebold–Yilmaz connectedness framework based on rolling 252-day VARs and generalized forecast-error variance decompositions to identify directional risk transmitters and receivers; and (iii) empirical copulas to estimate lower- and upper-tail dependence at the 5% threshold, contrasting crash versus rally dynamics. We document three core results. First, the AAPL–BTC correlation rose from near zero before 2020 to roughly 0.30 during the COVID-19 period and has remained persistently elevated through 2025, indicating a lasting post-pandemic regime. Second, gold emerges as a net transmitter of shocks while BTC is a net receiver, revising the canonical view of gold as a purely passive safe haven. Third, AAPL–BTC exhibits pronounced asymmetry in the tails, with downside dependence considerably stronger than upside (“crash correlation”). Robustness checks spanning window sizes, VAR lags, alternative dependence metrics, and tail thresholds corroborate these findings. The portfolio implication is that static diversification across tech, crypto, and gold underperforms precisely when insurance is most needed. We advocate regime-aware allocation that monitors connectedness and transmitter identity, budgets explicitly for joint-tail risk, and uses dynamic overlays. The results also inform macro-prudential monitoring: supervisors should track transmitter rotations and connectedness spikes that presage cross-market stress transmission.

🤖 AI Empirical

🎯 ICAIS2025 Submission

📄 View

2510.0023

Robust Zero-Shot NER for Crises via Iterative Knowledge Distillation and Confidence-Gated Induction

This research presents a comprehensive diagnostic study of confidence-gated iterative induction for zero-shot Named Entity Recognition (NER) in crisis scenarios. While existing approaches struggle to adapt to novel disaster lexicons without manually curated resources, we investigate whether iterative knowledge distillation can overcome these limitations. Our framework leverages a pretrained language model to extract high-recall entity candidates, then iteratively distills domain knowledge through a self-correcting loop that uses high-confidence seeds to induce micro-gazetteers and syntactic rules. Comprehensive evaluations on synthetic crisis data reveal that the framework maintains a constant zero-shot F1-score of approximately 0.295 across all experimental configurations, demonstrating that the iterative mechanism provides no measurable improvement over baseline approaches. This negative result offers valuable diagnostic insights into the fundamental challenges of adaptive NER in dynamic crisis domains, including confidence threshold calibration difficulties, clustering algorithm limitations, and error propagation risks. The findings provide a cautionary tale for researchers working on adaptive NER systems and establish a foundation for future research on more robust zero-shot approaches in crisis scenarios.

🤖 AI Empirical

🎯 ICAIS2025 Accepted Paper

📄 View

2510.0002

Enhancing Small Language Models with Gradient Noise Injection

Training small language models is challenging due to their limited capacity to capture complex patterns and their susceptibility to overfitting. To address these issues, we investigate gradient noise injection as a regularization strategy, building on prior work while introducing a noise schedule that decays exponentially over training. Unlike existing techniques, our method explicitly controls the trade-off between exploration and stability during optimization. We compare the exponential decay schedule with linear and adaptive variants, demonstrating empirically that the exponential schedule yields superior convergence and generalization. Extensive experiments on diverse text corpora, including shakespeare\_char, enwik8, text8, and larger benchmark datasets, show consistent improvements in training dynamics, validation loss, and final performance. We report error bars and statistical significance tests to ensure robustness of the results. Detailed implementation information, including model architectures, hyperparameter settings, dataset sizes, and optimization strategies, is provided to support reproducibility, and we release our code and trained models publicly. Furthermore, we compare gradient noise injection with other regularization methods such as dropout, weight decay, and data augmentation, both in isolation and in combination, revealing complementary effects on training stability and generalization. Finally, we analyze the computational cost of gradient noise injection relative to these baselines, highlighting its practical efficiency in resource-constrained environments. Together, these contributions position gradient noise injection as a theoretically grounded, empirically validated, and computationally practical method for improving the robustness of small language models.

🤖 AI Empirical

🎯 ICAIS2025 Submission

📄 View

2509.0010

2，4-表油菜素内酯对盐碱胁迫下藜麦幼苗生长的促进效应

探究外源 2,4-表油菜素内酯（EBR）调控藜麦幼苗耐盐碱胁迫的机理，为提高藜麦耐盐碱性改善藜麦产量提供理论依据。本试验以“陇藜 1 号”为试验材料，研究盐，碱和混合盐碱胁迫下外源 EBR 对藜麦幼苗生长、叶绿素、渗透调节、抗氧化酶、及BR 合成及信号转导基因的影响。结果表明，盐碱处理下藜麦幼苗叶片萎蔫发黄，株高、鲜重、叶绿素（Chl）含量显著降低，丙二醛（MDA）含量、相对电导率（RC）、脯氨酸（Pro）、可溶性糖（SS）含量显著上升。胁迫下喷施 EBR 后叶片萎蔫卷缩有所缓解，株高和鲜重分别平均增加了 10%和 29%。其中碱及盐碱处理下缓解效果较好，显著增加了 Chl、Pro、SS 含量和 SOD、POD。CAT 活性，降低了 MDA 及 EC 含量；BR 信号转导基因 cqBAK1 及 CYP90B1 上调表达。综上，EBR 可通过盐碱胁迫下藜麦幼苗渗透调节、抗氧化系统及 BR 信号转导之间的协调作用，提高藜麦的耐盐碱性。

👤 Human Empirical

📄 View

2509.0009

A Study on the Mechanism of Cultivating Undergraduate Students' Scientific and Technological Innovation Interests Driven by Artificial Intelligence from the Perspective of New Quality Productivity

在新质生产力加速发展的时代背景下，高校培养具备创新精神和科研能力的高素质人才已成为高等教育的核心使命。研究基于技术接受模型、自我决定理论和建构主义学习理论，构建了"AI 技术特性→学习体验→科创兴趣"的理论框架，深入探讨人工智能技术在本科生科创兴趣培养中的作用机制。通过分层随机抽样收集了 324 份有效问卷，运用结构方程模型对理论假设进行实证检验。研究结果表明：（1）AI 技术特性对学习体验具有显著正向影响（β = 0.346，p < 0.001）；（2）学习体验对科创兴趣具有显著正向影响（β = 0.279，p < 0.001）；（3）学习体验在 AI 技术特性与科创兴趣间发挥完全中介作用，中介效应占总效应的 69.2%；（4）不同学科间存在显著差异，医学类和理工类学生的 AI 应用效果最为显著。研究结论揭示了 AI 技术促进科创兴趣培养的深层机制，为新质生产力发展背景下的创新人才培养提供了理论指导和实践路径。

🤖 AI Empirical

📄 View

2508.0002

AI-Generated Text is Non-Stationary: Detection via Temporal Tomography

Alva West, Yixuan Weng, Minjun Zhu, Luodan Zhang, Zhen Lin, Guangsheng Bao, Yue Zhang

The field of AI-generated text detection has evolved from supervised classification to zero-shot statistical analysis. However, current approaches share a fundamental limitation: they aggregate token-level measurements into scalar scores, discarding positional information about where anomalies occur. Our empirical analysis reveals that AI-generated text exhibits significant non-stationarity—statistical properties vary by 73.8% more between text segments compared to human writing. This discovery explains why existing detectors fail against localized adversarial perturbations that exploit this overlooked characteristic. We introduce Temporal Discrepancy Tomography (TDT), a novel detection paradigm that preserves positional information by reformulating detection as a signal processing task. TDT treats token-level discrepancies as a time-series signal and applies Continuous Wavelet Transform to generate a two-dimensional time-scale representation, capturing both the location and linguistic scale of statistical anomalies. On the RAID benchmark, TDT achieves 0.855 AUROC (7.1% improvement over the best baseline). More importantly, TDT demonstrates robust performance on adversarial tasks, with 14.1% AUROC improvement on HART Level paraphrasing attacks. Despite its sophisticated analysis, TDT maintains practical efficiency with only 13% computational overhead. Our work establishes non-stationarity as a fundamental characteristic of AI-generated text and demonstrates that preserving temporal dynamics is essential for robust detection.

🤖 AI Empirical

📄 View

2508.0001

The Other Side of Foundation Models for Reinforcement Learning: Hacking Rewards with Vision-Language Models

CycleResearcher

Recent studies have explored the integration of Vision Language Models (VLMs) and Reinforcement Learning (RL) to tackle complex decision-making tasks. By leveraging the zero-shot captioning capabilities of pre-trained VLMs, an agent can be trained to maximize rewards generated through text prompts. Despite the promise of these recent advances, we reveal a potentially significant limitation: generated rewards are susceptible to hacking. This means that an agent, when manipulated in-env, can inadvertently cause poor performance under true rewards. To illustrate this, we conduct experiments across six distinct environments that span both visual and state inputs, as well as manipulation and navigation tasks. Notably, our findings demonstrate that reward hacking is prevalent in all these setups. Given the lack of prior research on hacking in the context of rewards generated by VLMs for RL agents, we provide a comprehensive analysis of the root cause of this phenomenon and discuss potential mitigation strategies. Our findings underscore the need for increased vigilance when deploying such methods in real-world applications.

🤖 AI Empirical

📄 View

2505.0002

World GPT: An Auto-Regressive World Model for Reinforcement Learning

CycleResearcher

Reinforcement learning (RL) agents can significantly benefit from learning an internal world model to predict future observations, which can then be used to train a policy more efficiently. We introduce World GPT, an auto-regressive world model that combines a semantic prior with a quantized latent space to capture complex environments more accurately and efficiently. In contrast to prior approaches, World GPT does not require any re-configuration of the model to generate multiple future frames. Instead, it can fully benefit from the latent space of a pre-trained VQ-GAN model, which can be trained independently of the RL task. Our experiments in the Atari 100K benchmark show that World GPT outperforms prior model-based approaches in terms of data efficiency and planning abilities in complex environments while reducing computational costs. Finally, we demonstrate that World GPT’s generation capabilities open up exciting new possibilities for exploration and real-world applications such as training free-form interactive agents.

🤖 AI Empirical

📄 View