📋 AI Review from DeepReviewer will be automatically processed
📋 AI Review from ZGCA will be automatically processed
The paper surveys LLM-based scientific agents and proposes a structured blueprint for building them. It distinguishes scientific agents from general-purpose agents (Section 2.1) and introduces a three-level taxonomy of agents (Assistant, Partner, Avatar; Fig. 2, Table 2) organized by construction strategy and capability scope. The survey presents a two-tier progressive framework: (i) constructing agents from scratch via knowledge organization (Section 3.1; Fig. 4), knowledge injection (Section 3.2; explicit vs. implicit; Table 3), and tool integration (Section 3.3; Table 4); and (ii) targeted capability enhancement along memory (Section 4.1; structures and systems), reasoning (Section 4.2; general and domain-specific; Fig. 5), and collaboration (Section 4.3; multi-agent and human-AI). It also aggregates benchmarks across knowledge-intensive and experiment-driven tasks (Table 6) and claims to maintain a live repository (AWESOME_SCIENTIFIC_AGENT). The stated contributions are a rigorous deconstruction of scientific agents in the natural sciences, practical guidance on construction and enhancement, and a linkage between the scientific lifecycle and agent-building strategies.
Cross‑Modal Consistency: 31/50
Textual Logical Soundness: 18/30
Visual Aesthetics & Clarity: 9/20
Overall Score: 58/100
Detailed Evaluation (≤500 words):
1. Cross‑Modal Consistency
• Visual ground truth: Fig.1 montage of the 6‑stage lifecycle (icons, chat bubbles, tiny captions); Fig.2 three‑column taxonomy (Assistant/Partner/Avatar) linking “Construction Strategy”↔“Capability Scope”; Fig.3 (a) “Construction from Scratch” pipeline; (b) “Abilities Enhancement” pipeline; Fig.4 knowledge organization flows (unstructured→structured/instructions/KG); Fig.5 reasoning enhancement blocks (CoT/self‑consistency/domain‑rules/symbolic).
• Major 1: Table‑header/entry mismatch in Table 2 “Application Stages” (defined as L/H/D/V/A/E) but cells show strings like “DVAE/LDAE/DQA”, mixing task types with stages. Evidence: “Application Stages is … L…H…D…V…A…E” vs rows “DVAE”, “DQA”.
• Major 2: Table 5 title says memory, but contents describe collaboration paradigms (role‑specialised agents, dialog & debate, knowledge sharing). Evidence: “Table 5: Scientific agents’ memory enhancement methods.” with rows “Role‑specialised agents… Dialog & debate…”.
• Major 3: Table 6 title promises collaboration challenge/mitigations, but the table lists evaluation benchmarks (BioMaze, GPQA, LAB‑Bench…). Evidence: “Table 6: Scientific agent collaboration challenge and mitigations.” followed by benchmark list.
• Major 4: Stage name inconsistency between text and Fig.1 (“literature mining” vs “Literature Synthesis”), weakening figure–text alignment. Evidence: “six stages as shown in Fig. 1: literature mining…” while pane reads “Literature Synthesis”.
• Minor 1: Fig.3 caption calls panels “(a) agent construction methodology… (b) Agent enhancement methodology” while the panels read “Scientist Agent Construction From Scratch” and “Scientific Agent Abilities Enhancement”.
• Minor 2: Several truncated/garbled references inside Section 3.3 (e.g., “AI Scientist‑v2 and Robinwithin…”), causing ambiguity in tool‑learning discussion.
2. Text Logic
• Major 1: Broken sentence in §3.3 interrupts reasoning about tool‑learning frameworks and planning/backtracking. Evidence: “most systems employ… RL/MCTS‑style search…g., AI Scientist‑v2 and Robinwithin…”.
• Minor 1: Numerous typos and inconsistencies (“contrusion”, “STURUCTURE”, “he principal contributions”), but overall narrative is still followable.
• Minor 2: Claims “we review existing benchmarks” are supported, but mislabeling of Table 6 obscures the intended structure.
3. Figure Quality
• Major 1: Fig.1 is cluttered and many panes/captions are illegible at ~100% print size; key labels (e.g., step titles, dialogues) cannot be read, blocking the reader’s “lifecycle” understanding.
• Minor 1: Fig.5 text is dense and small; legends (CoT1–CoT5, rules) would benefit from clearer typography.
• Minor 2: Fig.2 relies on icons; adding explicit legends for T/R/M/C and examples per level would strengthen the “figure‑alone” test.
Key strengths:
Key weaknesses:
Actionable fixes (highest impact first):
📋 AI Review from SafeReviewer will be automatically processed
In this paper, the authors present a comprehensive survey of the current landscape of LLM-based scientific agents, focusing on their construction methodologies, capability scopes, and practical applications. The paper introduces a three-level taxonomy of scientific agents—Agent as Assistant, Agent as Partner, and Agent as Avatar—based on their roles and functionalities. The authors provide a detailed review of the literature, categorizing agents by their knowledge organization, injection, and tool integration strategies. They also discuss the challenges and future directions in the field, emphasizing the need for more robust and specialized agents capable of handling complex scientific tasks. The paper includes a large table (Table 1) that compares recent surveys of LLM-based scientific research, and another table (Table 2) that lists various scientific agents, their methods, and application stages. While the paper offers a broad overview and is well-organized, it lacks in-depth analysis and detailed comparisons, which are crucial for a high-impact survey. The authors also promise a live repository (AWESOME_SCIENTIFIC_AGENT) to continuously aggregate emerging methods, benchmarks, and best practices, though this repository was not accessible at the time of review.
One of the key strengths of this paper is its comprehensive and well-organized structure. The authors have meticulously categorized a wide range of scientific agents, providing a clear taxonomy that helps readers understand the different levels of agent capabilities. The paper's focus on the construction methodologies, capability scopes, and practical applications of these agents is particularly valuable, as it offers a holistic view of the field. The authors also promise a live repository (AWESOME_SCIENTIFIC_AGENT) to continuously update the community on emerging methods, benchmarks, and best practices, which, if realized, could be a significant resource for researchers. The paper's breadth in covering various scientific domains and its detailed tables (Table 1 and Table 2) are also commendable, as they provide a useful reference for the current state of the field. The authors have done a good job in presenting the material in a clear and accessible manner, making it a valuable starting point for researchers new to the field of LLM-based scientific agents.
Despite its strengths, the paper has several significant weaknesses that need to be addressed. Firstly, the paper lacks in-depth analysis and detailed comparisons, which are crucial for a high-impact survey. For instance, in Section 2.2, the authors introduce a three-level taxonomy of scientific agents but do not provide a clear rationale for the selection of these specific levels. The paper states, 'By analyzing the prominent inductive features of representative studies, we introduce a three-tier taxonomy that encapsulates the progressive construction strategies and capability scope in Fig.2.' However, the criteria used for this analysis are not explicitly defined, making the taxonomy seem somewhat arbitrary. This lack of clarity undermines the paper's contribution and leaves readers questioning the validity of the proposed categories. Additionally, the paper's tables, particularly Table 1 and Table 2, are overly detailed and difficult to interpret. Table 1, which compares recent surveys, is hard to understand due to the use of abbreviations and the lack of explanation for the comparison criteria. Table 2, which lists various scientific agents, is so extensive that it is challenging to extract meaningful insights. The sheer volume of entries and the inconsistent marking of components and application stages further diminish its practical value. The paper also fails to provide a detailed analysis of the components, application stages, and task descriptions of the agents, which are essential for a thorough understanding of the field. For example, the 'Components' column in Table 2 uses abbreviations (T, R, M, C) without a clear legend within the table itself, and the 'Application Stages' column uses single letters that are not immediately interpretable. This lack of clarity and detail makes it difficult for readers to understand the specific functionalities and strengths of the agents. Furthermore, the paper's discussion of the limitations of current scientific agents is superficial. While the authors mention challenges such as 'Factualism and rationality,' 'Framework adapted complex scientific tasks,' 'Self-improvement iteration,' 'Interaction optimization for scientific exploration,' and 'Multi-disciplinary agent,' the analysis is high-level and lacks concrete examples and detailed explanations. For instance, the paper states, 'Evaluating the rationality and factual accuracy of scientific experimental designs still remain challenge for existing scientific agents,' but does not delve into specific instances where agents have failed or the underlying reasons for these failures. The paper also lacks a granular breakdown of the types of reasoning errors (e.g., logical fallacies, incorrect application of scientific principles) and the frequency of hallucinations in different scientific domains. This omission is particularly problematic, as it leaves the reader without a clear understanding of the practical limitations of current agents. Moreover, the paper does not adequately address the challenges in integrating specialized tools and workflows into scientific agents. The authors mention that many existing systems 'redundantly reinventing wheels across various domains and lacking cohesive, well-structured architectures,' but do not provide specific examples of these redundant efforts or the technical challenges involved. The paper also fails to discuss the difficulties in ensuring the reliability and accuracy of these tools, which is crucial for scientific applications. The live repository (AWESOME_SCIENTIFIC_AGENT) mentioned in the paper was not accessible at the time of review, which is a significant oversight. The repository's inaccessibility hinders the paper's claim of providing a continuously updated resource and raises questions about the authors' commitment to maintaining this valuable asset. Finally, the paper's discussion of evaluation metrics is inadequate. While the authors mention the use of metrics such as precision, recall, and F1-score, they do not provide a detailed analysis of the strengths and weaknesses of these metrics in the context of scientific research. The paper also lacks a discussion of the challenges in evaluating the creativity and innovation of scientific agents, which is essential for understanding their true potential and limitations. The absence of a dedicated 'Limitations' section further exacerbates these issues, as it leaves the reader without a comprehensive understanding of the challenges and future directions in the field. Overall, these weaknesses significantly impact the paper's ability to provide a thorough and insightful survey of LLM-based scientific agents.
To improve the paper, the authors should focus on providing more in-depth analysis and detailed comparisons. For the taxonomy introduced in Section 2.2, the authors should clearly define the criteria used to select the three levels and explain how these levels differ in terms of their capabilities, limitations, and potential applications. This would help readers understand the rationale behind the taxonomy and its practical implications. The authors should also condense and clarify the tables, particularly Table 1 and Table 2. Table 1 could be improved by providing a legend for the abbreviations and a brief explanation of the comparison criteria. Table 2 should be shortened to include only the most representative and influential agents, with a focus on their key components, application stages, and task descriptions. The authors could consider moving the full table to an appendix and including a summarized version in the main text. Additionally, the paper would benefit from a more detailed discussion of the limitations of current scientific agents. The authors should provide concrete examples of where agents have failed or struggled, including specific instances of reasoning errors, hallucinations, and challenges in integrating specialized tools. This would help readers understand the practical limitations of current agents and the challenges that need to be addressed. The authors should also delve deeper into the challenges of integrating specialized tools and workflows into scientific agents. They should discuss the difficulties in ensuring the reliability and accuracy of these tools and provide examples of redundant efforts across different domains. This would highlight the need for more standardized and modular approaches to tool integration. The live repository (AWESOME_SCIENTIFIC_AGENT) should be made accessible, and the authors should provide a clear explanation of its structure and how it will be maintained. This would enhance the paper's value as a continuously updated resource for the community. Finally, the authors should include a dedicated 'Limitations' section that discusses the challenges in evaluating the creativity and innovation of scientific agents. They should explore alternative evaluation metrics and methodologies that can better capture these aspects, such as expert reviews, case studies, and long-term impact assessments. This would provide a more balanced and comprehensive view of the field and guide future research efforts.
1. Could the authors provide a more detailed explanation of the criteria used to select the three levels in their taxonomy of scientific agents? How do these levels differ in terms of their capabilities, limitations, and potential applications? 2. What specific steps will the authors take to ensure the live repository (AWESOME_SCIENTIFIC_AGENT) is accessible and well-maintained? How will the repository be structured, and what types of information will it include? 3. Could the authors elaborate on the types of reasoning errors and hallucinations that current scientific agents are prone to? Are there specific domains or tasks where these issues are more prevalent, and what are the underlying causes? 4. What are the main challenges in integrating specialized tools and workflows into scientific agents, and how do these challenges vary across different scientific domains? Could the authors provide specific examples of redundant efforts and the technical difficulties involved? 5. How do the authors plan to address the limitations in evaluating the creativity and innovation of scientific agents? What alternative evaluation metrics and methodologies will they consider, and how will these be implemented in practice?