Papers

Spotlight Papers Show / Hide
  • 2510.0004
    A synergistic multi-specialist knowledge reasoning model for molecular science
    Pengfei Liu, Shuang Ge, Jun Tao, Zhixiang Ren
    The rapid evolution of artificial intelligence in molecular science necessitates a shift from data-driven predictions to knowledge-guided reasoning. Existing molecular models are predominantly proprietary, lacking general molecular intelligence and generalizability. To address this, we propose a task-adaptive large reasoning model that integrates molecular scientific logic to emulate the thinking of molecular scientists, with capabilities for reasoning and reflection. Our approach incorporates multi-specialist modules to provide versatile molecular expertise and a chain-of-thought (CoT) framework enhanced by reinforcement learning infused with molecular knowledge, enabling structured and reflective reasoning. The model outperforms over 20 state-of-the-art multi-task large language models (LLMs) across 10 molecular tasks on 47 metrics, including property prediction, molecule generation, and reaction prediction.It achieves a 50.3% improvement over the base model while ensuring interpretability. It can bridge data-driven and knowledge-integrated approaches for intelligent molecular design.
    👤 Human Methodology
    📄 View
  • 2510.0089
    BasketVision: Benchmarking MLLMs' Grasp of Complex Dynamic Systems
    While Multimodal Large Language Models (MLLMs) excel on general visual tasks, their capacity to comprehend complex dynamic systems remains a critical open question. Such systems, governed by physical laws, explicit rules, and multi-agent interactions, form the fabric of the real world. To facilitate a systematic diagnosis of current MLLM limitations, we introduce BasketVision, a new benchmark that leverages professional basketball as a microcosm for these dynamic environments. BasketVision probes model capabilities across seven dimensions—spanning perception, reasoning, and prediction—through 6,000 curated, bilingual questions from professional game data. An automated data generation pipeline underpins the benchmark, ensuring both scalability and fine-grained precision. Our evaluation of 23 leading models reveals a chasm between machine and human cognition: human experts attain 96.34% accuracy, while the premier model, GPT-4o, achieves only 63.15%. The analysis pinpoints spatial reasoning as a persistent bottleneck and uncovers specific patterns of task specialization. BasketVision thus serves as a crucial apparatus for charting the frontiers of MLLMs and steering future work toward more robust reasoning in dynamic visual worlds.
    👤 Human Methodology
    🎯 ICAIS2025 Accepted Paper
    📄 View
  • 2603.0004
    Correcting hybrid density functionals to model Y6 and other non-fullerene acceptors
    Tom Ward, Isabel Creed, Tim Rein, Jarvist Moore Frost
    Recently developed fused-ring organic electron-acceptors such as Y6 have strong oscillator strength, good charge-carrier transport and low bandgaps. They therefore have enormous current technical application to optoelectronic devices, such as solar cells. Due to the large number of atoms involved in representative aggregates of these materials, we need an efficient electronic structure method to model them. Standard density functional theory poorly describe charge-transfer states, and were developed for vacuum calculations of individual molecules. In this work we tune a range-separated hybrid functional for Y6. We characterise representative dimers of the solid-state and show that Y6 dimers show the extensive solvatochromic effects are due, in part, to oscillator strength borrowing. We provide an explanation for the short optimally tuned range-separation parameter, based in the Penn model for the frequency dependent dielectric of a semiconductor. We caution that standard range-separated hybrids are less accurate than global hybrids for these, and similar, materials. We show how reducing the range-separation length improves the accuracy of standard functionals, without an involved tuning process.
    👤 Human Theoretical
    📄 View
  • 2510.0013
    A Review of Intelligent Rock Mechanics: From Methods to Applications
    Artificial Intelligence (AI) has great potential to transform rock mechanics by tackling its inherent complexities, such as anisotropy, nonlinearity, discontinuous, and multiphase nature. This review explores the evolution of AI, from basic neural networks like the BP model to advanced architectures such as Transformers, and their applications in areas like microstructure reconstruction, prediction of mechanical parameters, and addressing engineering challenges such as rockburst prediction and tunnel deformation. Machine learning techniques, particularly Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs), have been crucial in automating tasks like fracture detection and efficiently generating 3D digital rock models. However, the effectiveness of AI in rock mechanics is limited by data scarcity and the need for high-quality datasets. Hybrid approaches, such as combining physics-informed neural networks (PINNs) with traditional numerical methods, offer promising solutions for solving governing equations. Additionally, Large Language Models (LLMs) are emerging as valuable tools for code generation and decision-making support. Despite these advancements, challenges remain, including issues with reproducibility, model interpretability, and adapting AI models to specific domains. Future progress will hinge on the availability of improved datasets, greater interdisciplinary collaboration, and the integration of spatial intelligence frameworks to bridge the gap between AI’s theoretical potential and its practical application in rock engineering.
    🤖 AI Survey
    🎯 ICAIS2025 Accepted Paper
    📄 View
  • 2511.0001
    PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors
    Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires specialized benchmarks currently lacking in the landscape. To address this gap, we introduce \textsc{PhysGym}, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments. \textsc{PhysGym}'s primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent. This allows researchers to dissect agent performance along axes including the complexity of the problem and the prior knowledge levels. The benchmark comprises a suite of interactive simulations, where agents must actively probe environments, gather data sequentially under constraints and formulate hypotheses about underlying physical laws. \textsc{PhysGym} provides standardized evaluation protocols and metrics for assessing hypothesis accuracy and model fidelity. We demonstrate the benchmark's utility by presenting results from baseline LLMs, showcasing its ability to differentiate capabilities based on varying priors and task complexity.
    👤 Human Methodology
    🎯 ICAIS2025 Accepted Paper
    📄 View
  • 2511.0009
    A Pilot Study Evaluating Large Language Models as Reviewers at Academic Conferences
    This paper presents a new system for academic peer review that is more objective, efficient, and community-guided. Our system incorporates author-assisted evaluation (Author-AAE) and community-guided review (CGR) into the peer review of AI conferences. This is in contrast to existing approaches that prioritize alternative systems that only address some of these challenges. Our evaluation uses data from three major AI conferences that used our system and from a survey of reviewers. Their feedback indicates that our system’s reviews are superior to single-LLM-based reviews due to their reduced subjectivity and enhanced quality. The reviewers’ scores for our system’s reviews were significantly higher than for single-LLM-based reviews across multiple metrics: “Reproducibility and Quality” (by 0.427 ± 0.007), “Review Quality” (by 0.265 ± 0.09), and “Alignment between opinion and paper score” (by 0.503 ± 0.090). In addition, we discovered that single-LLM-based reviews are more likely to be rejected by the program committee after author major revisions (on average by 0.182 ± 0.103) and are much more likely to be rejected overall (on average by 0.300 ± 0.124), compared to our system’s reviews. These results suggest that our system performs better in reducing the arbitrary nature of the current peer review system and can serve as an inspiration for the scientific community to explore new review systems.
    🤖 AI Empirical
    🎯 ICAIS2025 Accepted Paper
    📄 View
  • 2509.0011
    Reinforce Lifelong Interaction Value of User-Author Pairs for Large-Scale Recommendation Systems
    Yisha Li, Lexi Gao, Jingxin Liu, Xiang Gao, Xin Li, Haiyang Lu, Liyin Hong
    Recommendation systems (RS) help users find interested content and connect authors with their target audience. Most research in RS tends to focus either on predicting users’ immediate feedback (like click-through rate) accurately or improving users’ long-term engagement. However, they ignore the influence for authors and the lifelong interaction value (LIV) of user-author pairs, which is particularly crucial for improving the prosperity of social community on different platforms. Currently, reinforcement learning (RL) can optimize long-term benefits and has been widely applied in RS. In this paper, we introduce RL to Reinforce Lifelong Interaction Value of User-Author pairs (RLIV-UA) based on each interaction of UA pairs. To address the long intervals between UA interactions and the large scale of the UA space, we propose a novel Sparse Cross-Request Interaction Markov Decision Process (SCRI-MDP) and introduce an Adjacent State Approximation (ASA) method to construct RL training samples. Additionally, we introduce Multi-Task Critic Learning (MTCL) to capture the progressive nature of UA interactions (click → follow → gift), where denser interaction signals are leveraged to compensate for the learning of sparse labels. Finally, an auxiliary supervised learning task is designed to enhance the convergence of the RLIV-UA model. In offline experiments and online A/B tests, the RLIV-UA model achieves both higher user satisfaction and higher platform profits than compared methods.
    👤 Human Methodology
    📄 View
  • 2602.0002
    A Survey on Evaluation of Large Language Models
    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie
    Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the 'where' and 'how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs.
    👤 Human Survey
    📄 View
  • 2511.0018
    From Virtual Cells to Programmable Humans: Advancing Digital Biology Through Hybrid AI Systems
    The convergence of artificial intelligence and systems biology is giving rise to a new paradigm in biomedical research—AI-powered virtual biological systems. From single-cell simulations to organ-level models and ultimately programmable virtual humans, this digital continuum holds transformative potential for disease modeling, personalized medicine, and therapeutic discovery. In this review, we critically examine the state of the art in AI-driven simulations, including the numerical foundations, multiscale integration strategies, and the emerging class of hybrid models that bridge mechanistic and data-driven approaches. We explore the challenges of validation, uncertainty quantification, and regulatory alignment across simulation scales, with particular focus on the development of simulation accountability frameworks such as SIM-CARDs. Ethical and privacy concerns, including algorithmic bias and data sovereignty in patient-specific models, are also addressed, alongside concrete proposals for governance and federated simulation workflows. Special attention is given to the technical complexity of multiscale modeling, including the integration of mechanistic solvers with neural architectures and the computational resources required for real-time, clinically actionable simulations. We conclude with a translational roadmap for virtual biology that projects validated virtual cells for drug screening by 2030, multi-organ simulations by 2040, and the emergence of programmable virtual humans by 2055. By unifying high-fidelity numerical models with explainable AI, and aligning simulation design with ethical, regulatory, and clinical needs, the field of digital biology is positioned to unlock scalable and trustworthy biomedical innovation.
    🤖 AI Survey
    📄 View
  • 2603.0008
    面向大语言模型的记忆管理理论框架研究:认知自适应与用户参与的视角
    DeepSeek
    大语言模型在长程交互中面临记忆过载与用户失控的双重困境:无差别的海量存储导致认知负荷攀升,黑箱式的遗忘机制引发隐私信任危机。本研究提出一种兼具认知自适应与用户可干预的 AI 记忆管理理论框架(CAUM)。首先,基于信息熵、交互频率与冲突检测,设计多维记忆重要性评估模型,特别引入后文关联潜力作为信息价值评估的新维度,使记忆保留更具前瞻性;其次,构建包含原始层、摘要层与骨架层的分级存储架构,并引入阈值触发的智能压缩机制;最后,提出用户参与式授权机制,将“记忆整理提案”可视化呈现并由用户审核决策,实现“人在回路”的记忆治理。该框架为缓解 LLM 记忆过载问题提供了系统的概念方案,将信息生命周期理论拓展至 AI 记忆管理领域,强调用户中心的信息处置权,为人工智能时代的信息生命周期管理提供了新的理论视角,也为构建用户可控的智能记忆系统奠定了概念基础。
    🤖 AI Theoretical
    📄 View
Page 1 of 10 (Total 194 papers)