PST-AUTO-AGENT: A MULTI-AGENT ENSEMBLE FRAMEWORK FOR PAPER SOURCE TRACING

Paper Content

📄 Open in New Tab

🎓 Meta Review & Human Decision

Decision:

Meta Review:

AI Review from DeepReviewer

AI Review available after:

--d --h --m --s

📋 AI Review from DeepReviewer will be automatically processed

📋 Summary

This paper introduces a multi-agent ensemble framework designed to tackle the Paper Source Tracing (PST) problem, which aims to identify and quantify the influence of primary source papers for a given research paper. The authors propose a structured pipeline that integrates three state-of-the-art large language models (LLMs)—Deepseek-R1-250528, GPT-5-2025-08-07, and Gemini-2.5-pro. This pipeline includes several key stages: XML preprocessing to extract relevant data from paper documents, advanced prompt engineering to elicit detailed responses from the LLMs, and a sophisticated multi-agent integration strategy to combine the outputs of the individual models. The multi-agent integration strategy is a core component of the proposed framework, employing a combination of intelligent default scoring, confidence score extraction, and a consistency penalty mechanism. The intelligent default scoring assigns base scores to potential source papers and then adjusts these scores based on the predictions of the individual LLMs. The confidence score extraction further refines the scores by considering the confidence levels of the LLM predictions. Finally, the consistency penalty mechanism adjusts the scores based on the agreement between the different LLMs, aiming to reduce the impact of potentially unreliable predictions. The authors evaluate their proposed system on the PST-Bench dataset, a benchmark dataset for paper source tracing, and report a mean average precision (MAP) score of 0.388. This result represents a notable improvement over the performance of the individual baseline models, suggesting that the ensemble approach effectively leverages the strengths of each LLM. The paper also presents further results on the KDD Cup 2024 dataset, demonstrating the practical utility of the proposed method by enhancing the performance of a leading method in this competition. Overall, the paper presents a novel approach to the PST problem, combining multiple LLMs in a structured and intelligent manner, and achieving promising results on benchmark datasets. However, as I will discuss in the following sections, there are several areas where the paper could be improved to further strengthen its contributions.

✅ Strengths

The paper's primary strength lies in its innovative approach to combining multiple large language models (LLMs) for the task of paper source tracing. The proposed multi-agent ensemble framework, which integrates Deepseek-R1-250528, GPT-5-2025-08-07, and Gemini-2.5-pro, demonstrates a clear improvement over the performance of individual models. The structured pipeline, encompassing XML preprocessing, advanced prompt engineering, and a sophisticated multi-agent integration strategy, is well-defined and logically presented. The multi-agent integration strategy, in particular, is a notable contribution. The use of intelligent default scoring, confidence score extraction, and a consistency penalty mechanism showcases a thoughtful approach to combining the outputs of different LLMs. The intelligent default scoring mechanism, which assigns base scores and then adjusts them based on the LLM predictions, is a clever way to leverage the strengths of each model. The confidence score extraction further refines the results by considering the confidence levels of the LLM predictions. The consistency penalty mechanism, which adjusts scores based on the agreement between the models, is a valuable addition that helps to reduce the impact of potentially unreliable predictions. The empirical results presented in the paper further support the effectiveness of the proposed approach. The reported MAP score of 0.388 on the PST-Bench dataset represents a significant improvement over the baseline models, demonstrating the potential of the multi-agent ensemble framework. The additional results on the KDD Cup 2024 dataset further highlight the practical utility of the proposed method, showing that it can enhance the performance of existing methods in a real-world setting. These empirical achievements, combined with the innovative methodology, make the paper a valuable contribution to the field of paper source tracing. The paper also includes a dedicated 'RELATED WORK' section, which discusses relevant research areas and specific competing methods, providing a context for the paper's contributions.

❌ Weaknesses

While the paper presents a promising approach to paper source tracing, several weaknesses need to be addressed to further strengthen its contributions. First, the paper suffers from some issues in its writing and presentation. Specifically, in Section 6.5, the text refers to "Figure 2," which is not present in the provided document. This missing figure makes it difficult to fully understand the results presented in this section and undermines the clarity of the paper. This is a high confidence issue, as the absence of the figure is directly verifiable. Additionally, the explanation of the consistency penalty mechanism in Section 4.4.3 and the formula in Section 4.4.4 could be clearer. While the paper states that "penalty factors range from 0.1 (maximal disagreement) to 1.0 (minimal difference)," the exact method for calculating these penalty factors and how they are applied to the prediction scores is not explicitly defined. This lack of clarity makes it difficult to fully understand how the consistency penalty is implemented and how it affects the final results. This is a medium confidence issue, as the general idea is presented, but the specific implementation details are lacking. The reviewer also mentioned a statement about probability distribution normalization in Section 4.4.4, which is not present in the paper. This suggests a potential misunderstanding on the reviewer's part, but it also highlights the need for clearer explanations of the mathematical formulations in the paper. Second, while the paper presents an innovative multi-agent integration strategy, the core idea of combining the outputs of multiple LLMs through a weighted ensemble is not entirely novel. The paper emphasizes the "advanced ensemble methodology" and "intelligent ensemble strategies," but the specific mechanisms, such as the intelligent default scoring and consistency penalty, can be seen as extensions of existing ensemble techniques. The intelligent default scoring mechanism, while effective, is essentially a form of weighted contribution, where the weights are determined by the individual model predictions. The consistency penalty mechanism, while adding complexity, is still based on the fundamental idea of adjusting predictions based on inter-model agreement. The paper could benefit from a more thorough discussion of how its approach differs from existing ensemble methods and what specific novel contributions it makes beyond the combination of existing techniques. This is a medium confidence issue, as the paper does introduce some novel elements, but the core idea of ensemble learning is not new. Finally, the evaluation of the proposed method is primarily conducted on a single dataset, the PST-Bench dataset. While the paper mentions "further results on KDD Cup 2024" in Section 6.5, the core experimental validation relies on a single dataset. This limits the generalizability of the findings and raises concerns about the robustness of the proposed method. It is unclear how the method would perform on other datasets with different characteristics, such as different domains or different types of scientific literature. The paper should include evaluations on additional datasets to demonstrate the generalizability of the proposed method. This is a high confidence issue, as the experimental setup clearly indicates the use of PST-Bench as the main evaluation dataset. The lack of evaluation on other datasets is a significant limitation that needs to be addressed.

💡 Suggestions

To address the identified weaknesses, I recommend several concrete improvements. First, the authors should thoroughly revise the paper to improve its clarity and presentation. Specifically, the missing "Figure 2" in Section 6.5 must be included. The authors should also provide a more detailed explanation of the consistency penalty mechanism in Section 4.4.3 and the formula in Section 4.4.4. This should include a clear definition of how the penalty factors are calculated and how they are applied to the prediction scores. The authors should also clarify the mathematical formulations and provide more details on the implementation of the proposed method. This will help to address the confusion surrounding the consistency penalty mechanism and ensure that the paper is easily understood by the readers. Second, the authors should provide a more thorough discussion of the novelty of their proposed method. While the core idea of combining the outputs of multiple LLMs is not entirely novel, the specific mechanisms introduced in this paper, such as the intelligent default scoring and consistency penalty, are valuable contributions. The authors should clearly articulate how these mechanisms differ from existing ensemble methods and what specific novel contributions they make beyond the combination of existing techniques. This will help to strengthen the paper's claims and highlight its unique contributions. The authors should also compare their method to other existing ensemble methods to demonstrate the advantages of their approach. Third, the authors should evaluate their proposed method on additional datasets to demonstrate its generalizability. This should include datasets from different domains and with different characteristics. This will help to address the concerns about the robustness of the proposed method and ensure that it is applicable to a wider range of paper source tracing tasks. The authors should also provide a detailed analysis of the performance of the proposed method on these additional datasets and discuss any limitations or challenges that they encounter. This will help to further strengthen the paper's findings and provide valuable insights for future research. Finally, the authors should consider providing more details on the computational resources required for their proposed method. This should include information on the training time, inference time, and memory requirements. This will help to assess the practicality of the approach and ensure that it is accessible to other researchers. By addressing these weaknesses and implementing these suggestions, the authors can significantly improve the quality and impact of their paper.

❓ Questions

Based on my analysis, I have several questions that I believe are crucial for a deeper understanding of the paper's methodology and findings. First, regarding the consistency penalty mechanism, I would like to understand the specific mathematical formulation used to calculate the penalty factors. The paper states that these factors range from 0.1 to 1.0, but it does not provide the exact equation or algorithm used to determine these values. How are the levels of agreement between the LLMs quantified, and how are these quantifications translated into penalty factors? A more detailed explanation of this process would be beneficial. Second, given that the core idea of the paper is to combine the outputs of multiple LLMs, how does the proposed method compare to other existing ensemble methods? The paper mentions the use of an "advanced ensemble methodology," but it does not provide a detailed comparison to other established ensemble techniques. What specific advantages does the proposed method offer compared to other ensemble methods, and what are its limitations? A more thorough discussion of this aspect would help to contextualize the paper's contributions. Third, regarding the evaluation of the proposed method, while the paper presents results on the PST-Bench dataset and the KDD Cup 2024 dataset, it does not provide a detailed analysis of the characteristics of these datasets. What are the specific characteristics of these datasets, such as the distribution of paper topics, the average length of papers, and the complexity of the citation networks? How might these characteristics affect the performance of the proposed method, and how might the method perform on datasets with different characteristics? A more detailed analysis of the datasets and their impact on the results would be valuable. Finally, regarding the computational resources required for the proposed method, the paper does not provide any information on the training time, inference time, or memory requirements. What are the computational costs associated with the proposed method, and how do they compare to the baseline models? A detailed analysis of the computational resources required would help to assess the practicality of the approach and ensure that it is accessible to other researchers. These questions are aimed at clarifying key aspects of the paper's methodology and findings, and I believe that addressing them would significantly enhance the paper's overall quality and impact.

📊 Scores

Soundness:1.75

Presentation:1.75

Contribution:2.0

Rating: 3.0

AI Review from ZGCA

ZGCA Review available after:

--d --h --m --s

📋 AI Review from ZGCA will be automatically processed

📋 Summary

The paper addresses the Paper Source Tracing (PST) task: given a focal paper and its references, identify the most influential "source papers" and assign importance weights. The authors propose pst-auto-agent, a multi-agent LLM ensemble that integrates DeepSeek-R1-250528, GPT-5-2025-08-07, and Gemini-2.5-pro via a pipeline including XML preprocessing (Section 4.2), advanced prompt engineering (Counterfactual Reasoning, Idea DNA Matching, Multi-Role Socratic Dialogue; Section 4.3), and an intelligent ensemble with model-specific weights, an "Intelligent Default Scoring" fallback, and a Consistency Penalty Mechanism (Sections 4.4–4.5). Evaluated with MAP on PST-Bench (they cite Zhang et al., 2024), the ensemble achieves MAP 0.388, outperforming individual LLMs (Table 1). An ablation (Table 2) shows contributions from each component. The method is also reported to complement a top KDD Cup 2024 solution (Section 6.5).

✅ Strengths

Clear problem framing and motivation for PST; use of MAP with a temporal split (Sections 3 and 5) is appropriate and realistic.
Concrete engineering contributions: multi-agent ensemble with consistency penalty and default scoring; carefully thought XML processing pipeline (Sections 4.2–4.5).
Empirical improvements over strong LLM baselines, with a reasonably thorough ablation that quantifies each component’s impact (Section 6.4, Table 2).
Prompt engineering strategy is well-motivated (counterfactual reasoning, multi-role dialogue) and linked to measurable gains (Table 2).
Potential practical impact, including reported complementary performance in a competitive setting (KDD Cup 2024; Section 6.5).

❌ Weaknesses

Reproducibility is limited: exact prompt templates, the full ensemble algorithm specification (e.g., the function C(·) in Section 4.4.4), hyperparameter selection protocol for weights (Section 4.4), and default scoring edge cases are not provided.
The central claim of being "tuning-free" (Abstract, Section 1) is debatable since extensive, empirically optimized prompt engineering (Section 4.3) and empirically set ensemble weights (Section 4.4) constitute optimization choices; this should be reframed more carefully.
Comparative evaluation lacks non-LLM baselines common in citation/source tracing (e.g., citation count, BM25/textual similarity, bibliographic coupling/co-citation variants, GNN/GCN baselines); only per-LLM baselines are reported (Section 5.2, Table 1).
Ablations lack statistical reporting (no variance/confidence intervals, number of runs), and there is no qualitative error/success analysis to illuminate failure modes (Section 6.4).
Ambiguity around PST-Bench: the paper both cites Zhang et al. (2024) and at times writes as if introducing the dataset; authorship/availability and inter-annotator agreement are unclear (Sections 1, 3.2, Contributions).
Some technical inconsistencies/ambiguities: the consistency penalty uses P(i) = sum of C(|score differences|) but the text refers to D as the factor (Section 4.4.4); XML processing is claimed to ensure "100% reliability" (Section 4.2) yet later only "high reliability" is reported; details of the confidence extraction and default scoring pipeline are underspecified (Section 4.4.1–4.4.2).
Heavy dependence on closed, evolving APIs (GPT-5-2025-08-07, Gemini-2.5-pro) without versioned prompts or seeds makes exact replication difficult; the use period and model versions should be documented precisely (Sections 4.1, 6.1).

❓ Questions

Reproducibility: Will you release the exact prompt templates (for each model), the parsing/normalization scripts, and a reference implementation of the ensemble (including the exact function C(·) in Section 4.4.4 and all constants used in default scoring/penalty)?
Weights and tuning: How were the model weights (0.30/0.35/0.35) selected? Grid search on validation? Please report the search space, selection criterion, and whether any parameters were adjusted after seeing test results.
Prompt optimization: You mention evaluating over 1,000 human-annotated papers to identify the optimal prompt configuration (Section 4.3). On which split(s) were these experiments done? How did you avoid test leakage? Please provide the procedure and the final prompt variants verbatim.
PST-Bench authorship/availability: Do you author PST-Bench or are you using an existing dataset (Zhang et al., 2024)? If you are the authors, please clarify the relationship to Zhang et al. (2024) and provide dataset access details, licensing, and inter-annotator agreement. If not, please adjust the contribution claims accordingly.
Baselines: Can you add non-LLM baselines (e.g., citation counts, BM25/similarity to title/abstract/full text, bibliographic coupling/co-citation, simple GCN/heterogeneous graph baselines) to contextualize the gains?
Statistics and variability: Please report means with standard deviations or confidence intervals over multiple runs for MAP and ablations. How stable are results across seeds/API runs?
Qualitative analysis: Provide case studies showing (a) correctly identified primary sources with the model’s reasoning, (b) typical failure modes, and (c) how the consistency penalty corrects disagreements.
Penalty function details: What is the explicit form and range mapping of C(·)? How are pairwise differences aggregated and normalized? Is P(i) guaranteed to be in [0,1]?
Default scoring: The base 0.3 + 0.2 per model + 0.1 bonus can exceed 1 before capping. How sensitive is performance to these constants? Did you tune them on validation?
API dependency: Which exact API versions/dates were used? Are prompts/outputs deterministic (e.g., temperature=0)? How do you mitigate API drift over time?
KDD Cup integration: Can you report numbers (not only the leaderboard figure) quantifying the lift when integrating pst-auto-agent into English Hercules, and describe the integration details?

⚠️ Limitations

Reproducibility: Heavy reliance on closed-source LLM APIs and missing prompt/implementation details impede exact replication.
Generalization: The dataset domain is computer science; performance in other domains with different citation practices is unknown.
Attribution risks: Automating identification of "primary sources" may over-attribute or mis-attribute credit, potentially impacting evaluations of scholarly impact if used without human oversight.
Cost/latency: Multi-LLM querying can be expensive and slow at scale; no cost/throughput analysis is provided.
Data leakage/memorization: Closed LLMs trained on scholarly corpora may implicitly encode citation knowledge, conflating reasoning with memorization; no analysis is provided.
XML robustness: Despite claims of high reliability, malformed or ambiguous XML still requires manual intervention (Section 4.2).

🖼️ Image Evaluation

Cross‑Modal Consistency: 28/50

Textual Logical Soundness: 22/30

Visual Aesthetics & Clarity: 7/20

Overall Score: 57/100

Detailed Evaluation (≤500 words):

Visual ground truth (image‑first):

• Figure 1: “Muti_Agent Framework” – image missing/garbled (“[ width=.8 ] figs/… figure.png”); no panes visible.

• Figure 2: “Leaderboard of KDD Cup 2024” – image placeholder “[width=.8]figs/pst-leaderboard.png”; not viewable.

• Table 1: Baseline performance (MAP). Trend: ensemble (0.388) > Gemini (0.318) ≈ GPT (0.315) > DeepSeek (0.246).

• Table 2: Ablations. Trend: removing ensemble parts notably hurts MAP; full model best.

1. Cross‑Modal Consistency

• Major 1: Core figures are missing/illegible, blocking verification of the framework and leaderboard claims. Evidence: “Figure 1: Muti_Agent Framework” and “[width = .8]figs/pst-leaderboard.png”.

• Major 2: Figure numbering mismatch (“Figure 1” vs “Figure4.1”), creating ambiguity about the architecture diagram. Evidence: “An illustrative diagram … Figure4.1.”

• Major 3: Consistency penalty is described as “maximum pairwise” but the equation sums all pairs and mixes C/D. Evidence: “A dynamic penalty … based on maximum pairwise” vs “P(i)= C(|s_deepseek−s_gpt|)+…; where 𝖣 is the … Factor”.

• Major 4: “Six‑component pipeline” lists only five components, confusing the system overview. Evidence: “six‑component pipeline: XML Preprocessing, Prompt Engineering, Multi-Agent Prediction, Intelligent Ensemble, and Prediction Method.”

• Minor 1: Model naming inconsistent: “Deepseek‑R1‑250528/0528/DeepSeek‑R1.” Evidence: Sec. Abstract vs Sec. 4.4 vs Table 1.

• Minor 2: JSON key formatting inconsistent (“confidence Scores” with space). Evidence: Sec. 4.4.1 example.

2. Text Logic

• Major 1: Penalty function not fully specified (C undefined mapping; no normalization; contradiction with “maximum”). Evidence: “penalty factors ranging from 0.1 to 1.0” + undefined C in 4.4.4.

• Minor 1: “Tuning‑free baseline” yet prompts are optimized on >1,000 labeled papers; scope of tuning not quantified. Evidence: “optimal prompt configuration achieving the highest F1‑score.”

• Minor 2: Subset selection of PST‑Bench not analyzed for bias; implications for generalization not discussed. Evidence: “use a subset of 1,576 papers … criteria (1)‑(3).”

3. Figure Quality

• Major 1: Figures are absent/garbled; “Figure‑Alone” test fails; cannot infer messages without captions. Evidence: garbled LaTeX path in Sec. 3.2; placeholder in Sec. 6.5.

• Minor 1: Typo in figure title (“Muti_Agent”) and uneven typography in equations (spaced letters) hinder readability. Evidence: “Muti_Agent Framework”; “s _ {g p t}”.

Key strengths:

• Clear, useful ablation isolating major contributors.

• Simple, reproducible default‑scoring heuristic; tables report consistent MAP improvements.

• Practical pipeline with XML fallbacks and prompt discipline.

Key weaknesses:

• Missing/illegible figures and numbering errors block verification of core claims.

• Penalty function/notation inconsistencies; pipeline component count mismatch.

• Model/version naming inconsistent; limited reproducibility details for closed APIs and prompt tuning.

📊 Scores

Originality:2

Quality:2

Clarity:2

Significance:2

Soundness:2

Presentation:2

Contribution:2

Rating: 4

AI Review from SafeReviewer

SafeReviewer Review available after:

--d --h --m --s

📋 AI Review from SafeReviewer will be automatically processed

📋 Summary

This paper introduces a novel approach to Paper Source Tracing (PST), a task focused on identifying the most influential cited papers that contribute to a focal paper's central ideas or methodologies. The authors propose a multi-agent ensemble framework, termed `pst-auto-agent`, which leverages three state-of-the-art large language models (LLMs): Deepseek-R1-250528, GPT-5-2025-08-07, and Gemini-2.5-pro. The core idea is to combine the strengths of these models through a structured pipeline involving XML preprocessing, advanced prompt engineering, and an intelligent ensemble strategy. The prompt engineering incorporates counterfactual reasoning, Idea DNA matching, and multi-role Socratic dialogue to elicit detailed responses from the LLMs. The ensemble strategy includes a consistency penalty mechanism to mitigate inconsistent predictions and an intelligent default scoring system when explicit confidence scores are unavailable. The authors also introduce PST-Bench, a dataset of 2,141 meticulously labeled computer science publications, which serves as a benchmark for evaluating PST methods. The experimental results demonstrate that the proposed `pst-auto-agent` model achieves a Mean Average Precision (MAP) score of 0.388 on PST-Bench, outperforming individual baseline models. Furthermore, the authors show that integrating their method with the top-ranked English Hercules framework in the KDD Cup 2024 enhances its performance, securing the 4th place on the leaderboard and ranking 1st among all GPU-free methods. This work highlights the potential of combining multiple LLMs and advanced prompting techniques for complex academic information tracing tasks, offering a tuning-free baseline for the PST problem that does not require feature engineering.

❌ Weaknesses

After a thorough examination of the paper, I've identified several key weaknesses that warrant attention. First, the paper's novelty is limited, as it primarily focuses on ensembling existing LLMs rather than introducing a fundamentally new approach to the PST problem. While the authors claim a novel multi-agent ensemble architecture, the core components are established LLMs, and the specific implementation details of the ensemble strategy are not fully elaborated. The paper states, "We propose a novel multi-agent ensemble architecture designed to tackle the PST problem effectively. Our system leverages the combined strengths of state-of-the-art large language models (LLMs): Deepseek-R1-250528, GPT-5-2025-08-07,and Gemini-2.5-pro." This highlights the reliance on existing models, and the lack of detailed explanation about the ensemble mechanism makes it difficult to assess its true novelty. This is further compounded by the fact that the paper does not compare its method against the state-of-the-art solutions from the KDD Cup 2024, which were mentioned in the related work section. The paper mentions, "Recent advances in natural language processing and graph neural networks have enabled more sophisticated approaches to literature analysis... And paper source tracing have witnessed significant methodological innovations, particularly within the context of the KDD Cup 2024 OAG Challenge." However, these methods are not used as direct baselines in the experiments, which limits the ability to gauge the true advancement of the proposed method. The experimental evaluation is also insufficient, as the authors only compare their method against the individual LLMs used in their ensemble, rather than a wider range of existing PST methods. The paper states, "We implement several baseline models for comparison to validate the design of our approach: DeepSeek-R1-0528, GPT-5-2025-08-07,and Gemini-2.5-pro." This narrow selection of baselines makes it difficult to assess the true performance of the proposed method relative to the broader landscape of PST techniques. The paper also lacks sufficient detail regarding the implementation of its key components. The prompt engineering strategy, which is described as a "unified prompt architecture incorporating multiple advanced reasoning frameworks," is not explained in sufficient detail. The paper states, "Our prompt engineering methodology is grounded in extensive empirical analysis... The unified prompt architecture incorporates multiple advanced reasoning frameworks derived from cutting-edge LLM research: Counterfactual Reasoning, Idea DNA Matching, Multi-Role Socratic Dialogue." However, the specific prompts used and the implementation of these reasoning frameworks are not provided, making it difficult to reproduce or fully understand the method. Similarly, the consistency penalty mechanism is described using general terms, but the exact mathematical formulation and implementation details are not provided. The paper states, "A dynamic penalty function is applied based on maximum pairwise score differences between models,with penalty factors ranging from O.1 (maximal disagreement) to 1.O (minimal difference) to mitigate inconsistent predictions." This lack of detail makes it hard to assess the effectiveness and potential limitations of this mechanism. Furthermore, the paper's claim of a "tuning-free" method is misleading, as the ensemble weights are empirically set, which constitutes a form of tuning. The paper states, "We employ an advanced ensemble methodology that intelligently combines predictions from three complementary LLMs with optimized weight allocation. Empirically, we set the weight for each LLM as follows: DeepSeek-R1-0528 (0.3), GPT-5 (0.35), Gemini-2.5-pro (0.35)." This contradicts the claim of being "tuning-free" and raises questions about the generalizability of the chosen weights. Finally, the paper's reliance on a single dataset, PST-Bench, limits the generalizability of the results. The paper states, "To address the limitations of existing methods and provide a robust benchmark for the PST problem, we introduce PST-Bench (Zhang et al., 2024), a novel dataset comprising 2,141 meticulously labeled computer science publications." While the KDD Cup 2024 evaluation provides some evidence of broader applicability, the primary evaluation is on a single domain-specific dataset. These weaknesses collectively suggest that while the paper presents a promising approach, it requires further refinement and more rigorous evaluation to fully establish its contribution to the field.

💡 Suggestions

To address the identified weaknesses, I recommend several concrete improvements. First, the authors should significantly expand the experimental evaluation by including a wider range of baseline methods, particularly those that performed well in the KDD Cup 2024, such as the top-performing solutions mentioned in the related work section. This would provide a more robust assessment of the proposed method's performance relative to the state-of-the-art. The paper should also include more traditional baselines, such as TF-IDF or simpler citation-based methods, to provide a more comprehensive comparison. Second, the authors need to provide a more detailed explanation of the proposed method's implementation. This includes providing the specific prompts used in the prompt engineering strategy, as well as the mathematical formulation and implementation details of the consistency penalty mechanism. The paper should also clarify the roles of each agent in the ensemble and the criteria for selecting these specific models. This level of detail is crucial for reproducibility and for understanding the method's inner workings. Third, the authors should re-evaluate their claim of a "tuning-free" method. The empirical setting of ensemble weights constitutes a form of tuning, and the paper should acknowledge this and provide a more detailed explanation of how these weights were determined. It would also be beneficial to explore the sensitivity of the method to different weight settings and to consider more principled approaches for weight optimization. Fourth, the authors should expand the evaluation to include additional datasets to demonstrate the generalizability of the proposed method. This could involve using datasets from different domains or different types of academic literature. This would help to address the concern that the method is overfit to the PST-Bench dataset. Fifth, the authors should provide a more detailed analysis of the results, including a breakdown of the performance on different types of papers or citations. This would help to identify the strengths and weaknesses of the proposed method and provide insights into its behavior. The paper should also include an analysis of the computational cost of the proposed method and compare it to the baselines. Finally, the authors should address the formatting issues in the paper, such as the excessive spacing between paragraphs and the placement of Figure 1. These issues detract from the readability and overall quality of the paper. By addressing these points, the authors can significantly strengthen their work and make a more substantial contribution to the field.

❓ Questions

Several key questions arise from my analysis of this paper. First, what specific prompts were used for each of the reasoning frameworks (counterfactual reasoning, Idea DNA matching, and multi-role Socratic dialogue)? Providing these prompts would allow for a better understanding of the prompt engineering strategy and its impact on the results. Second, what is the precise mathematical formulation of the consistency penalty mechanism? The paper describes it conceptually but lacks the specific details needed for implementation and evaluation. Third, how were the ensemble weights (0.3, 0.35, 0.35) determined? The paper states they were "empirically set," but it does not provide details on the process used to arrive at these specific values. Fourth, what is the computational cost of the proposed method compared to the baseline models? The paper does not provide any analysis of the computational resources required by the ensemble approach. Fifth, how does the proposed method perform on different types of papers or citations? A more detailed analysis of the results could reveal potential biases or limitations of the method. Sixth, how does the performance of the proposed method compare to the top-performing solutions from the KDD Cup 2024 on the same evaluation set? This comparison would provide a better understanding of the method's performance relative to the state-of-the-art. Finally, what are the limitations of the proposed method, and what are the potential avenues for future research? The paper does not explicitly discuss the limitations of the approach or suggest directions for future work. Addressing these questions would provide a more complete and nuanced understanding of the proposed method and its potential impact on the field.

Rating: 3.67

Keywords

Click the button to extract keywords

Insights

Click the button to extract insights

Paper Content

🎓 Meta Review & Human Decision

Decision:

Meta Review:

AI Review from DeepReviewer

📋 Summary

✅ Strengths

❌ Weaknesses

💡 Suggestions

❓ Questions

📊 Scores

AI Review from ZGCA

📋 Summary

✅ Strengths

❌ Weaknesses

❓ Questions

⚠️ Limitations

🖼️ Image Evaluation

📊 Scores

AI Review from SafeReviewer

📋 Summary

❌ Weaknesses

💡 Suggestions

❓ Questions

Keywords

Insights

📝 Cite This Paper