📋 AI Review from DeepReviewer will be automatically processed
📋 AI Review from ZGCA will be automatically processed
The paper addresses the Paper Source Tracing (PST) task: given a focal paper and its references, identify the most influential "source papers" and assign importance weights. The authors propose pst-auto-agent, a multi-agent LLM ensemble that integrates DeepSeek-R1-250528, GPT-5-2025-08-07, and Gemini-2.5-pro via a pipeline including XML preprocessing (Section 4.2), advanced prompt engineering (Counterfactual Reasoning, Idea DNA Matching, Multi-Role Socratic Dialogue; Section 4.3), and an intelligent ensemble with model-specific weights, an "Intelligent Default Scoring" fallback, and a Consistency Penalty Mechanism (Sections 4.4–4.5). Evaluated with MAP on PST-Bench (they cite Zhang et al., 2024), the ensemble achieves MAP 0.388, outperforming individual LLMs (Table 1). An ablation (Table 2) shows contributions from each component. The method is also reported to complement a top KDD Cup 2024 solution (Section 6.5).
Cross‑Modal Consistency: 28/50
Textual Logical Soundness: 22/30
Visual Aesthetics & Clarity: 7/20
Overall Score: 57/100
Detailed Evaluation (≤500 words):
Visual ground truth (image‑first):
• Figure 1: “Muti_Agent Framework” – image missing/garbled (“[ width=.8 ] figs/… figure.png”); no panes visible.
• Figure 2: “Leaderboard of KDD Cup 2024” – image placeholder “[width=.8]figs/pst-leaderboard.png”; not viewable.
• Table 1: Baseline performance (MAP). Trend: ensemble (0.388) > Gemini (0.318) ≈ GPT (0.315) > DeepSeek (0.246).
• Table 2: Ablations. Trend: removing ensemble parts notably hurts MAP; full model best.
1. Cross‑Modal Consistency
• Major 1: Core figures are missing/illegible, blocking verification of the framework and leaderboard claims. Evidence: “Figure 1: Muti_Agent Framework” and “[width = .8]figs/pst-leaderboard.png”.
• Major 2: Figure numbering mismatch (“Figure 1” vs “Figure4.1”), creating ambiguity about the architecture diagram. Evidence: “An illustrative diagram … Figure4.1.”
• Major 3: Consistency penalty is described as “maximum pairwise” but the equation sums all pairs and mixes C/D. Evidence: “A dynamic penalty … based on maximum pairwise” vs “P(i)= C(|s_deepseek−s_gpt|)+…; where 𝖣 is the … Factor”.
• Major 4: “Six‑component pipeline” lists only five components, confusing the system overview. Evidence: “six‑component pipeline: XML Preprocessing, Prompt Engineering, Multi-Agent Prediction, Intelligent Ensemble, and Prediction Method.”
• Minor 1: Model naming inconsistent: “Deepseek‑R1‑250528/0528/DeepSeek‑R1.” Evidence: Sec. Abstract vs Sec. 4.4 vs Table 1.
• Minor 2: JSON key formatting inconsistent (“confidence Scores” with space). Evidence: Sec. 4.4.1 example.
2. Text Logic
• Major 1: Penalty function not fully specified (C undefined mapping; no normalization; contradiction with “maximum”). Evidence: “penalty factors ranging from 0.1 to 1.0” + undefined C in 4.4.4.
• Minor 1: “Tuning‑free baseline” yet prompts are optimized on >1,000 labeled papers; scope of tuning not quantified. Evidence: “optimal prompt configuration achieving the highest F1‑score.”
• Minor 2: Subset selection of PST‑Bench not analyzed for bias; implications for generalization not discussed. Evidence: “use a subset of 1,576 papers … criteria (1)‑(3).”
3. Figure Quality
• Major 1: Figures are absent/garbled; “Figure‑Alone” test fails; cannot infer messages without captions. Evidence: garbled LaTeX path in Sec. 3.2; placeholder in Sec. 6.5.
• Minor 1: Typo in figure title (“Muti_Agent”) and uneven typography in equations (spaced letters) hinder readability. Evidence: “Muti_Agent Framework”; “s _ {g p t}”.
Key strengths:
• Clear, useful ablation isolating major contributors.
• Simple, reproducible default‑scoring heuristic; tables report consistent MAP improvements.
• Practical pipeline with XML fallbacks and prompt discipline.
Key weaknesses:
• Missing/illegible figures and numbering errors block verification of core claims.
• Penalty function/notation inconsistencies; pipeline component count mismatch.
• Model/version naming inconsistent; limited reproducibility details for closed APIs and prompt tuning.
📋 AI Review from SafeReviewer will be automatically processed
This paper introduces a novel approach to Paper Source Tracing (PST), a task focused on identifying the most influential cited papers that contribute to a focal paper's central ideas or methodologies. The authors propose a multi-agent ensemble framework, termed `pst-auto-agent`, which leverages three state-of-the-art large language models (LLMs): Deepseek-R1-250528, GPT-5-2025-08-07, and Gemini-2.5-pro. The core idea is to combine the strengths of these models through a structured pipeline involving XML preprocessing, advanced prompt engineering, and an intelligent ensemble strategy. The prompt engineering incorporates counterfactual reasoning, Idea DNA matching, and multi-role Socratic dialogue to elicit detailed responses from the LLMs. The ensemble strategy includes a consistency penalty mechanism to mitigate inconsistent predictions and an intelligent default scoring system when explicit confidence scores are unavailable. The authors also introduce PST-Bench, a dataset of 2,141 meticulously labeled computer science publications, which serves as a benchmark for evaluating PST methods. The experimental results demonstrate that the proposed `pst-auto-agent` model achieves a Mean Average Precision (MAP) score of 0.388 on PST-Bench, outperforming individual baseline models. Furthermore, the authors show that integrating their method with the top-ranked English Hercules framework in the KDD Cup 2024 enhances its performance, securing the 4th place on the leaderboard and ranking 1st among all GPU-free methods. This work highlights the potential of combining multiple LLMs and advanced prompting techniques for complex academic information tracing tasks, offering a tuning-free baseline for the PST problem that does not require feature engineering.
After a thorough examination of the paper, I've identified several key weaknesses that warrant attention. First, the paper's novelty is limited, as it primarily focuses on ensembling existing LLMs rather than introducing a fundamentally new approach to the PST problem. While the authors claim a novel multi-agent ensemble architecture, the core components are established LLMs, and the specific implementation details of the ensemble strategy are not fully elaborated. The paper states, "We propose a novel multi-agent ensemble architecture designed to tackle the PST problem effectively. Our system leverages the combined strengths of state-of-the-art large language models (LLMs): Deepseek-R1-250528, GPT-5-2025-08-07,and Gemini-2.5-pro." This highlights the reliance on existing models, and the lack of detailed explanation about the ensemble mechanism makes it difficult to assess its true novelty. This is further compounded by the fact that the paper does not compare its method against the state-of-the-art solutions from the KDD Cup 2024, which were mentioned in the related work section. The paper mentions, "Recent advances in natural language processing and graph neural networks have enabled more sophisticated approaches to literature analysis... And paper source tracing have witnessed significant methodological innovations, particularly within the context of the KDD Cup 2024 OAG Challenge." However, these methods are not used as direct baselines in the experiments, which limits the ability to gauge the true advancement of the proposed method. The experimental evaluation is also insufficient, as the authors only compare their method against the individual LLMs used in their ensemble, rather than a wider range of existing PST methods. The paper states, "We implement several baseline models for comparison to validate the design of our approach: DeepSeek-R1-0528, GPT-5-2025-08-07,and Gemini-2.5-pro." This narrow selection of baselines makes it difficult to assess the true performance of the proposed method relative to the broader landscape of PST techniques. The paper also lacks sufficient detail regarding the implementation of its key components. The prompt engineering strategy, which is described as a "unified prompt architecture incorporating multiple advanced reasoning frameworks," is not explained in sufficient detail. The paper states, "Our prompt engineering methodology is grounded in extensive empirical analysis... The unified prompt architecture incorporates multiple advanced reasoning frameworks derived from cutting-edge LLM research: Counterfactual Reasoning, Idea DNA Matching, Multi-Role Socratic Dialogue." However, the specific prompts used and the implementation of these reasoning frameworks are not provided, making it difficult to reproduce or fully understand the method. Similarly, the consistency penalty mechanism is described using general terms, but the exact mathematical formulation and implementation details are not provided. The paper states, "A dynamic penalty function is applied based on maximum pairwise score differences between models,with penalty factors ranging from O.1 (maximal disagreement) to 1.O (minimal difference) to mitigate inconsistent predictions." This lack of detail makes it hard to assess the effectiveness and potential limitations of this mechanism. Furthermore, the paper's claim of a "tuning-free" method is misleading, as the ensemble weights are empirically set, which constitutes a form of tuning. The paper states, "We employ an advanced ensemble methodology that intelligently combines predictions from three complementary LLMs with optimized weight allocation. Empirically, we set the weight for each LLM as follows: DeepSeek-R1-0528 (0.3), GPT-5 (0.35), Gemini-2.5-pro (0.35)." This contradicts the claim of being "tuning-free" and raises questions about the generalizability of the chosen weights. Finally, the paper's reliance on a single dataset, PST-Bench, limits the generalizability of the results. The paper states, "To address the limitations of existing methods and provide a robust benchmark for the PST problem, we introduce PST-Bench (Zhang et al., 2024), a novel dataset comprising 2,141 meticulously labeled computer science publications." While the KDD Cup 2024 evaluation provides some evidence of broader applicability, the primary evaluation is on a single domain-specific dataset. These weaknesses collectively suggest that while the paper presents a promising approach, it requires further refinement and more rigorous evaluation to fully establish its contribution to the field.
To address the identified weaknesses, I recommend several concrete improvements. First, the authors should significantly expand the experimental evaluation by including a wider range of baseline methods, particularly those that performed well in the KDD Cup 2024, such as the top-performing solutions mentioned in the related work section. This would provide a more robust assessment of the proposed method's performance relative to the state-of-the-art. The paper should also include more traditional baselines, such as TF-IDF or simpler citation-based methods, to provide a more comprehensive comparison. Second, the authors need to provide a more detailed explanation of the proposed method's implementation. This includes providing the specific prompts used in the prompt engineering strategy, as well as the mathematical formulation and implementation details of the consistency penalty mechanism. The paper should also clarify the roles of each agent in the ensemble and the criteria for selecting these specific models. This level of detail is crucial for reproducibility and for understanding the method's inner workings. Third, the authors should re-evaluate their claim of a "tuning-free" method. The empirical setting of ensemble weights constitutes a form of tuning, and the paper should acknowledge this and provide a more detailed explanation of how these weights were determined. It would also be beneficial to explore the sensitivity of the method to different weight settings and to consider more principled approaches for weight optimization. Fourth, the authors should expand the evaluation to include additional datasets to demonstrate the generalizability of the proposed method. This could involve using datasets from different domains or different types of academic literature. This would help to address the concern that the method is overfit to the PST-Bench dataset. Fifth, the authors should provide a more detailed analysis of the results, including a breakdown of the performance on different types of papers or citations. This would help to identify the strengths and weaknesses of the proposed method and provide insights into its behavior. The paper should also include an analysis of the computational cost of the proposed method and compare it to the baselines. Finally, the authors should address the formatting issues in the paper, such as the excessive spacing between paragraphs and the placement of Figure 1. These issues detract from the readability and overall quality of the paper. By addressing these points, the authors can significantly strengthen their work and make a more substantial contribution to the field.
Several key questions arise from my analysis of this paper. First, what specific prompts were used for each of the reasoning frameworks (counterfactual reasoning, Idea DNA matching, and multi-role Socratic dialogue)? Providing these prompts would allow for a better understanding of the prompt engineering strategy and its impact on the results. Second, what is the precise mathematical formulation of the consistency penalty mechanism? The paper describes it conceptually but lacks the specific details needed for implementation and evaluation. Third, how were the ensemble weights (0.3, 0.35, 0.35) determined? The paper states they were "empirically set," but it does not provide details on the process used to arrive at these specific values. Fourth, what is the computational cost of the proposed method compared to the baseline models? The paper does not provide any analysis of the computational resources required by the ensemble approach. Fifth, how does the proposed method perform on different types of papers or citations? A more detailed analysis of the results could reveal potential biases or limitations of the method. Sixth, how does the performance of the proposed method compare to the top-performing solutions from the KDD Cup 2024 on the same evaluation set? This comparison would provide a better understanding of the method's performance relative to the state-of-the-art. Finally, what are the limitations of the proposed method, and what are the potential avenues for future research? The paper does not explicitly discuss the limitations of the approach or suggest directions for future work. Addressing these questions would provide a more complete and nuanced understanding of the proposed method and its potential impact on the field.