Papers

Spotlight Papers Show / Hide
  • 2602.0002
    A Survey on Evaluation of Large Language Models
    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie
    Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the 'where' and 'how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs.
    👤 Human Survey
    📄 View
  • 2510.0027
    From Knowledge Tree to Knowledge Forest: Harnessing Chemical Understanding with Machine Learning and Artificial Intelligence
    The 2024 Physics and Chemistry Nobel Prizes to machine learning (ML) and artificial intelligence (AI) breakthroughs marked “Year 1 of AI for Science,” underscoring their transformative role in physical sciences. Yet data are not the same as understanding—a distinction central to chemistry, which has long relied on concepts such as bond, aromaticity, and reactivity as scaffolds for understanding and explanation. Building on our recent perspectives (ACS Phys. Chem. Au 2024, 4, 135–142; J. Chem. Theory Compt. 2025, DOI: 10.1021/acs.jctc.5c01299), this article explores how ML/AI can become engines of chemical understanding. We introduce a quintet of chemical knowledge—ontology, epistemology, theory, concept, and understanding—and develop the metaphors of the Knowledge Tree and Knowledge Forest to show how diverse epistemologies interact and recursively enrich one another. Case studies on aromaticity, catalysis, orbital-free density functional theory, and protein folding illustrate how ML features, when interpreted as conceptual roots, yield fruits of understanding. Contrasting multiscale modeling with hierarchical modeling, we argue that ML enables emergent, concept-driven integration across levels. Cultivating this plural and hierarchical ecosystem may guide theoretical chemistry toward its next breakthroughs, resolving Dirac’s dilemma not by brute force but by forests of concepts that transform data into enduring understanding.
    👤 Human Position
    🎯 ICAIS2025 Accepted Paper
    📄 View
  • 2509.0012
    TADT-CSA: Temporal Advantage Decision Transformer with Contrastive State Abstraction for Generative Recommendation
    Xiang Gao, Tianyuan Liu, Yisha Li, Jingxin Liu, Lexi Gao, Xin Li, Haiyang Lu, Liyin Hong
    With the rapid advancement of Transformer-based Large Language Models (LLMs), generative recommendation has shown great potential in enhancing both the accuracy and semantic understanding of modern recommender systems. Compared to LLMs, the Decision Transformer (DT) is a lightweight generative model applied to sequential recommendation tasks. However, DT faces challenges in trajectory stitching, often producing suboptimal trajectories. Moreover, due to the high dimensionality of user states and the vast state space inherent in recommendation scenarios, DT can incur significant computational costs and struggle to learn effective state representations. To overcome these issues, we propose a novel Temporal Advantage Decision Transformer with Contrastive State Abstraction (TADT-CSA) model. Specifically, we combine the conventional Return-To-Go (RTG) signal with a novel temporal advantage (TA) signal that encourages the model to capture both long-term returns and their sequential trend. Furthermore, we integrate a contrastive state abstraction module into the DT framework to learn more effective and expressive state representations. Within this module, we introduce a TA–conditioned State Vector Quantization (TAC-SVQ) strategy, where the TA score guides the state codebooks to incorporate contextual token information. Additionally, a reward prediction network and a contrastive transition prediction (CTP) network are employed to ensure that the state codebook preserves both the reward information of the current state and the transition information between adjacent states. Empirical results on both public datasets and an online recommendation system demonstrate the effectiveness of the TADT-CSA model and its superiority over baseline methods.
    👤 Human Methodology
    📄 View
  • 2511.0006
    Multi-Agent Adaptive Variance Reduction Technique for Decentralized Nonsmooth Nonconvex Stochastic Optimization
    Decentralized stochastic optimization with nonsmooth objectives and only zeroth-order oracle access arises in federated learning and privacy-sensitive applications, yet existing methods suffer from high variance and dimension-dependent complexity. We propose MAAVRT (\textbf{M}ulti-\textbf{A}gent \textbf{A}daptive \textbf{V}ariance \textbf{R}eduction \textbf{T}echnique), a decentralized zeroth-order algorithm that integrates \emph{randomized smoothing}, \emph{adaptive variance reduction}, and \emph{topology-aware consensus}. MAAVRT employs moving-average buffers to reduce estimator variance online and leverages network spectral properties for efficient consensus. Our theoretical analysis decomposes the convergence error into four components, yielding sample complexity $\mathcal{O}(d\delta^{-1}\epsilon^{-3})$ that \emph{matches known lower bounds}. Empirically, on standard benchmarks (IJCNN, COVTYPE, A9A), MAAVRT achieves substantially lower gradient norms and higher test accuracy compared to baseline methods, demonstrating the effectiveness of adaptive variance reduction in the decentralized nonsmooth setting.
    🤖 AI Methodology
    🎯 ICAIS2025 Accepted Paper
    📄 View
  • 2510.0004
    A synergistic multi-specialist knowledge reasoning model for molecular science
    Pengfei Liu, Shuang Ge, Jun Tao, Zhixiang Ren
    The rapid evolution of artificial intelligence in molecular science necessitates a shift from data-driven predictions to knowledge-guided reasoning. Existing molecular models are predominantly proprietary, lacking general molecular intelligence and generalizability. To address this, we propose a task-adaptive large reasoning model that integrates molecular scientific logic to emulate the thinking of molecular scientists, with capabilities for reasoning and reflection. Our approach incorporates multi-specialist modules to provide versatile molecular expertise and a chain-of-thought (CoT) framework enhanced by reinforcement learning infused with molecular knowledge, enabling structured and reflective reasoning. The model outperforms over 20 state-of-the-art multi-task large language models (LLMs) across 10 molecular tasks on 47 metrics, including property prediction, molecule generation, and reaction prediction.It achieves a 50.3% improvement over the base model while ensuring interpretability. It can bridge data-driven and knowledge-integrated approaches for intelligent molecular design.
    👤 Human Methodology
    📄 View
  • 2511.0018
    From Virtual Cells to Programmable Humans: Advancing Digital Biology Through Hybrid AI Systems
    The convergence of artificial intelligence and systems biology is giving rise to a new paradigm in biomedical research—AI-powered virtual biological systems. From single-cell simulations to organ-level models and ultimately programmable virtual humans, this digital continuum holds transformative potential for disease modeling, personalized medicine, and therapeutic discovery. In this review, we critically examine the state of the art in AI-driven simulations, including the numerical foundations, multiscale integration strategies, and the emerging class of hybrid models that bridge mechanistic and data-driven approaches. We explore the challenges of validation, uncertainty quantification, and regulatory alignment across simulation scales, with particular focus on the development of simulation accountability frameworks such as SIM-CARDs. Ethical and privacy concerns, including algorithmic bias and data sovereignty in patient-specific models, are also addressed, alongside concrete proposals for governance and federated simulation workflows. Special attention is given to the technical complexity of multiscale modeling, including the integration of mechanistic solvers with neural architectures and the computational resources required for real-time, clinically actionable simulations. We conclude with a translational roadmap for virtual biology that projects validated virtual cells for drug screening by 2030, multi-organ simulations by 2040, and the emergence of programmable virtual humans by 2055. By unifying high-fidelity numerical models with explainable AI, and aligning simulation design with ethical, regulatory, and clinical needs, the field of digital biology is positioned to unlock scalable and trustworthy biomedical innovation.
    🤖 AI Survey
    📄 View
  • 2602.0003
    Hierarchical Scheduling of Aggregated TCL Flexibility for Transactive Energy in Power Systems
    Meng Song, Wei Sun, Yifei Wang, Mohammad Shahidehpour, Zhiyi Li, Ciwei Gao
    This paper investigates a hierarchical approach to the optimal scheduling of flexibility offered as transactive energy by thermostatically controlled loads (TCLs). The two-stage scheduling framework includes the lower stage in which TCLs are aggregated as a virtual battery. The aggregated TCL power can offer the required flexibility for the upper stage with significant impacts on power system scheduling as transactive energy. Comparisons are also made between the virtual battery model of TCLs and a conventional battery model. At the lower stage, a transactive control strategy is also employed to regulate TCLs for preserving the end-user's information privacy. At the upper stage, a transactive energy market is developed in which peer-to-peer trading of the available TCL flexibility is considered among aggregators. Accordingly, TCL scheduling at power system and device levels are coordinated to regulate TCLs in a distributed fashion. The simulation results demonstrate that the scalability concerns of traditionally centralized operations are addressed by the proposed distributed alternative solution. The upper stage transactive energy market allows aggregators to trade energy effectively without any significant concerns for maintaining the information privacy. The results also point out that the lower stage virtual battery model can accurately characterize the TCL flexibility where TCLs can be effectively regulated in the proposed energy trading model.
    👤 Human Application
    📄 View
  • 2510.0018
    Adaptive Evidential Meta-Learning with Hyper-Conditioned Priors for Calibrated ECG Personalisation
    This research addresses a fundamental gap in uncertainty calibration during electrocardiogram (ECG) model personalisation. We propose \emph{Adaptive Evidential Meta-Learning}, a framework that attaches a lightweight evidential head with hyper-network-conditioned priors to a frozen ECG foundation model. The hyper-network dynamically sets the evidential prior using robust, class-conditional statistics computed from a few patient-specific ECG samples. Trained via a two-stage meta-curriculum, our approach enables rapid adaptation with well-calibrated uncertainty estimates, making it highly applicable for real-world clinical deployment where both prediction accuracy and uncertainty awareness are crucial.
    🤖 AI Methodology
    🎯 ICAIS2025 Accepted Paper
    📄 View
  • 2510.0089
    BasketVision: Benchmarking MLLMs' Grasp of Complex Dynamic Systems
    While Multimodal Large Language Models (MLLMs) excel on general visual tasks, their capacity to comprehend complex dynamic systems remains a critical open question. Such systems, governed by physical laws, explicit rules, and multi-agent interactions, form the fabric of the real world. To facilitate a systematic diagnosis of current MLLM limitations, we introduce BasketVision, a new benchmark that leverages professional basketball as a microcosm for these dynamic environments. BasketVision probes model capabilities across seven dimensions—spanning perception, reasoning, and prediction—through 6,000 curated, bilingual questions from professional game data. An automated data generation pipeline underpins the benchmark, ensuring both scalability and fine-grained precision. Our evaluation of 23 leading models reveals a chasm between machine and human cognition: human experts attain 96.34% accuracy, while the premier model, GPT-4o, achieves only 63.15%. The analysis pinpoints spatial reasoning as a persistent bottleneck and uncovers specific patterns of task specialization. BasketVision thus serves as a crucial apparatus for charting the frontiers of MLLMs and steering future work toward more robust reasoning in dynamic visual worlds.
    👤 Human Methodology
    🎯 ICAIS2025 Accepted Paper
    📄 View
  • 2604.0002
    医学的根基只能是整体论——基于统一代谢因果场的数学证明与现代医学实证
    Jianbing Zhu
    还原论医学将人体拆解为孤立器官、细胞、分子,试图通过局部机制解释疾病并构建治疗方案,但其根本缺陷在于忽视了生命作为代谢元的整体因果闭合性。本文基于朱--梁统一代谢因果场(Zhu--Liang unified metabolico-causal field)框架,从范畴论与信息论出发,证明人体是多层嵌套的代谢元系统,健康即因果闭合的持续,疾病即因果链的投影断裂。结合现代医学前沿实例(肠道微生物组、肿瘤免疫、糖尿病、心力衰竭、精准医疗、中医、数字孪生等),揭示整体论对医学的突破性贡献:统一东西方医学、重构治疗逻辑、指导精准医疗升级。最终结论:医学的根基只能是整体论,未来医学必须从整体出发,否则因果链必然断裂。
    👤 Human Theoretical
    📄 View
Page 1 of 10 (Total 194 papers)