Papers
Event:
-
2604.0160View纯净版循环宇宙:基于 R ≡ 0 几何公理的宇宙学模型构建与观测验证本文基于单一几何公理「时空标量曲率恒为零 $$R \equiv 0$$」,以无宇宙学常数的标准爱因斯坦场方程为基础框架,在均匀各向同性、尘埃近似的约束下,完成了无需引入暗物质、暗能量、暴胀场等特设概念的循环宇宙模型全程顺向推导。本文引入的几何能动张量 $$\Theta_{\mu\nu}$$ 是 $$R \equiv 0$$ 约束的必然推论,而非额外特设的物质场;核心几何-物质耦合系数 $$\kappa$$ 为模型公理体系的唯一代数解,无任何可调自由参数。通过Ia型超新星(Pantheon+样本)、BAO大尺度结构、Planck 2018 CMB角功率谱的联合验证,本模型在上述核心宇宙学观测特征上与现有数据完全吻合,与ΛCDM模型统计不可区分,且天然缓解哈勃张力,无需引入任何ΛCDM框架下的特设物理机制。 ---
-
2602.0002ViewA Survey on Evaluation of Large Language ModelsLarge language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the 'where' and 'how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs.
-
2510.0039ViewUncertainty Quantification in Machine Learning for Responsible AIMachine learning and artificial intelligence will be deeply embedded in the intelligent systems humans use to automate tasking, optimize planning, and support decision-making. We present a critical review of uncertainty quantification (UQ) in large language models (LLMs), synthesizing insights from over 80 papers across leading venues (ACL, ASE, NeurIPS, ICML, AAAI, IJCAI, Nature, and others). We introduce UQ-Net, a unified probabilistic framework that combines Bayesian modeling, calibration, conformal prediction, and selective decision rules to disentangle epistemic and aleatoric uncertainty and to support reliable decision thresholds. UQ-Net integrates uncertainty estimates with calibration procedures and anomaly detection to enable safer selective deployment of LLM agents. Through case studies in medical diagnosis and code generation, we demonstrate that UQ-Net improves calibration and reduces predictive error by 15–20% relative to standard baselines. We survey existing evaluation practices and identify critical gaps: misalignment of consistency and entropy with factuality, lack of benchmarks for multi-episode interactions, and inconsistent metrics for calibration and tightness. We advocate for context-aware datasets, standardized metrics, and human-in-the-loop evaluations to better align UQ methods with deployment needs. Our review and proposed framework offer a principled foundation for operationalizing UQ in LLMs, advancing the development of trustworthy, responsible agentic AI for safety-sensitive, real-world applications.
-
2510.0038ViewThe Hitchhiker's Guide to Autonomous Research: A Survey of Scientific AgentsThe advancement of LLM-based agents is redefining AI for Science (AI4S) by enabling autonomous scientific research. Prominent LLMs exhibited expertise across multiple domains, catalysing constructions of domain-specialised scientific agents. Nevertheless, the profound epistemic and methodological gaps between AI and the natural sciences still impede the systematic design, training, and validation of these agents. This survey bridges the existing gap by presenting an exhaustive blueprint for scientific agents, spanning systematic construction methodologies, targeted capability enhancement, and rigorous evaluations. Anchored in the canonical scientific workflow, this paper (i) pinpoints the overview of scientific agents, starting with the development from general-purpose agents to scientific agents driven by articulated goal-orientation, then subsequently advancing a comprehensive taxonomy that organises existing agents by construction strategy and capability scope, and (ii) introduces a two-tier progressive framework, from scientific agents contrustion from scratch to targeted capability enhancement, for realizing autonomous scientific research. It is our aspiration that this survey will serve as guidance for researchers across various domains, facilitating the systematic design of domain-specific scientific agents and stimulating further innovation in AI-driven scientific research. To support long-term progress, we curate a live repository (\href{https://github.com/gudehhh666/Awesome_Scientific_Agent.git}{\textsc{Awesome\_Scientific\_Agent}}) that continuously aggregates emerging methods, benchmarks, and best practices.