Papers
Event:
-
2602.0002ViewA Survey on Evaluation of Large Language ModelsLarge language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the 'where' and 'how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs.
-
2510.0039ViewUncertainty Quantification in Machine Learning for Responsible AIMachine learning and artificial intelligence will be deeply embedded in the intelligent systems humans use to automate tasking, optimize planning, and support decision-making. We present a critical review of uncertainty quantification (UQ) in large language models (LLMs), synthesizing insights from over 80 papers across leading venues (ACL, ASE, NeurIPS, ICML, AAAI, IJCAI, Nature, and others). We introduce UQ-Net, a unified probabilistic framework that combines Bayesian modeling, calibration, conformal prediction, and selective decision rules to disentangle epistemic and aleatoric uncertainty and to support reliable decision thresholds. UQ-Net integrates uncertainty estimates with calibration procedures and anomaly detection to enable safer selective deployment of LLM agents. Through case studies in medical diagnosis and code generation, we demonstrate that UQ-Net improves calibration and reduces predictive error by 15β20% relative to standard baselines. We survey existing evaluation practices and identify critical gaps: misalignment of consistency and entropy with factuality, lack of benchmarks for multi-episode interactions, and inconsistent metrics for calibration and tightness. We advocate for context-aware datasets, standardized metrics, and human-in-the-loop evaluations to better align UQ methods with deployment needs. Our review and proposed framework offer a principled foundation for operationalizing UQ in LLMs, advancing the development of trustworthy, responsible agentic AI for safety-sensitive, real-world applications.
-
2510.0038ViewThe Hitchhiker's Guide to Autonomous Research: A Survey of Scientific AgentsThe advancement of LLM-based agents is redefining AI for Science (AI4S) by enabling autonomous scientific research. Prominent LLMs exhibited expertise across multiple domains, catalysing constructions of domain-specialised scientific agents. Nevertheless, the profound epistemic and methodological gaps between AI and the natural sciences still impede the systematic design, training, and validation of these agents. This survey bridges the existing gap by presenting an exhaustive blueprint for scientific agents, spanning systematic construction methodologies, targeted capability enhancement, and rigorous evaluations. Anchored in the canonical scientific workflow, this paper (i) pinpoints the overview of scientific agents, starting with the development from general-purpose agents to scientific agents driven by articulated goal-orientation, then subsequently advancing a comprehensive taxonomy that organises existing agents by construction strategy and capability scope, and (ii) introduces a two-tier progressive framework, from scientific agents contrustion from scratch to targeted capability enhancement, for realizing autonomous scientific research. It is our aspiration that this survey will serve as guidance for researchers across various domains, facilitating the systematic design of domain-specific scientific agents and stimulating further innovation in AI-driven scientific research. To support long-term progress, we curate a live repository (\href{https://github.com/gudehhh666/Awesome_Scientific_Agent.git}{\textsc{Awesome\_Scientific\_Agent}}) that continuously aggregates emerging methods, benchmarks, and best practices.