📋 AI Review from DeepReviewer will be automatically processed
📋 AI Review from ZGCA will be automatically processed
The paper proposes a hybrid peer-review system that integrates Author-Assisted Evaluation (AAE) and Community-Guided Review (CGR) within an LLM-mediated pipeline. In this system, authors provide self-scores and justifications (AAE), community members provide pre-review ratings (CGR), and the LLM synthesizes these signals to produce a review. Notably, the LLM is not provided the paper’s full text; instead, it "picks a random ordinate" and generates a caption and rating (Section 2: "Our work, however, provides a rating only by asking the LLM to pick a random ordinate without providing it with the full text of the paper."). The paper compares against two rule-based LLM baselines: a "single-LLM" reviewer and ReviewerGPT (Liu & Shah, 2023). Using data from AAAI 2024 and CoRLL 2024 (and ICML 2023 PC data for correlation analyses), the authors report that reviewers preferred reviews produced by their system on certain metrics (e.g., "Reproducibility and Quality," "Review Quality," and alignment), and that single-LLM reviews were more often rejected by the PC than the proposed system’s reviews. The evaluation includes survey data from 9 reviewers and various acceptance-rate and correlation analyses.
Cross‑Modal Consistency: 20/50
Textual Logical Soundness: 12/30
Visual Aesthetics & Clarity: 8/20
Overall Score: 40/100
Detailed Evaluation (≤500 words):
Visual ground truth (tables only; no figures found)
• Table 1/2 blocks (HTML): multi-panel acceptance/score summaries for AAAI 2024, CoRLL 2024, ICML(2023/2024), with columns REJECTED/ACCEPTED, MEAN/STDEV/p‑value; rows AAE+CGR (duplicated), HUMAN; legends unclear.
• Table with REJ/ACC/REJ+/ACC‑ (two versions): means, stdevs, p‑values; dataset blocks AAAI/CoRLL; no definitions in-table.
• Tables 3–4: labeled “opinion-based” and “correlations,” but show similar numeric blocks; inconsistent labels/years.
• Table 5–8: “Correlations” and “Rebuttal Alignment,” small two-column correlation tables with p‑values and R/τ; repeated/duplicated content, year mismatch.
1. Cross‑Modal Consistency
• Major 1: Acceptance‑rate claim vs table contents mismatch; tables show means/p‑values, not acceptance rates. Evidence: “Table 2: Acceptance rates …” and “singleellm … 3% and 3.6%” (Sec 3.1).
• Major 2: Numerically inconsistent correlations (near‑1 correlations with non‑significant p‑values). Evidence: Table 5 “p-value 0.365, R 0.990”.
• Major 3: Year/dataset mismatch (ICML 2023 in text vs ICML 2024 in tables). Evidence: Table caption block shows “ICML 2024” while Sec 2.1 uses “ICML 2023”.
• Major 4: Survey precision implausible for n=9 (very small ±0.007 SE). Evidence: Sec 3.2 “9 reviewers” and “0.427 ± 0.007”.
• Minor 1: Table numbering/order inconsistent (Table 2 described before Table 1).
• Minor 2: Repeated unlabeled “AAE+CGR” rows under the same dataset in a single table.
2. Text Logic
• Major 1: Core terms “owner” vs “author” used inconsistently, altering meaning of analyses. Evidence: Sec 3.4 “owner’s rating is higher than the author’s own score”.
• Major 2: Methodological claim lacks supporting experiment: “Pick Random Ordinates … better than randomly selecting a sentence” without evidence. Evidence: Sec 2 “This approach is better …”.
• Minor 1: “two‑enzyme analysis” is unclear/likely typo for a statistical method. Evidence: Sec 3.3 “two‑enzyme analysis”.
• Minor 2: Algorithms contain stray “=0” and broken step numbering. Evidence: Algorithm 1 “author name =0”; lines 10–12 misnumbered.
3. Figure Quality
• Major 1: Tables fail figure‑alone test; undefined abbreviations (REJ/ACC/REJ+/ACC‑), missing method labels, duplicated rows. Evidence: Multi‑panel HTML tables with no in‑table legends.
• Minor 1: Dataset year labels inconsistent within tables (ICML 2023/2024).
• Minor 2: Captions do not explain what MEAN represents (scores? rates?).
Key strengths:
• Ambitious system proposal (AAE+CGR) with multi‑conference datasets.
• Attempt to triangulate with PC scores, reviewer surveys, and correlations.
Key weaknesses:
• Severe figure–text mismatches (acceptance rates vs means; year labels).
• Statistical inconsistencies and implausible precision for small n.
• Terminology ambiguity (“owner” vs “author”) undermines analyses.
• Tables lack self‑contained legends/definitions; duplicated/misnumbered content.
Recommended fixes (highest impact):
• Recompute and clearly separate acceptance rates vs score means; fix captions.
• Standardize dataset naming (ICML 2023) and correct correlation tables.
• Define REJ/ACC/REJ+/ACC‑ in-table; label all rows (AAE+CGR, single‑LLM, HUMAN).
• Clarify “owner” terminology; detail “random ordinate” experiments or remove the claim.
📋 AI Review from SafeReviewer will be automatically processed
This paper introduces a novel system for academic peer review, aiming to enhance objectivity, efficiency, and community involvement. The proposed system integrates author-assisted evaluation (AAE) and community-guided review (CGR) into the traditional review process for AI conferences. The core idea is to leverage large language models (LLMs) to synthesize reviews, incorporating input from both authors and the broader community. Specifically, authors provide a self-assessment of their work, including a score and justification, which is then fed into the LLM reviewer. Additionally, the system gathers ratings from a community of reviewers, which are also incorporated into the LLM's review generation process. The authors evaluated their system using data from three major AI conferences, comparing it against a baseline single-LLM review system. Their results suggest that the proposed system produces reviews of higher quality, with improved reproducibility and better alignment with reviewer opinions, leading to a lower rejection rate after major revisions. The authors also conducted a survey where human reviewers preferred the system-generated reviews over single-LLM reviews. The paper's central contribution lies in the specific combination of AAE and CGR within an LLM-based review system, aiming to address the limitations of traditional peer review, such as subjectivity and time consumption. The authors argue that their system offers a more robust and reliable approach to evaluating academic papers. However, the paper's presentation and experimental design have several limitations that need to be addressed to fully validate the proposed approach. The paper's findings suggest that incorporating author input and community feedback can enhance the quality of LLM-generated reviews, but further work is needed to address the identified weaknesses and to explore the generalizability of the system to other domains.
The paper's primary strength lies in its exploration of a novel approach to academic peer review, integrating author-assisted evaluation and community-guided review within an LLM-based system. This is a timely and relevant area of research, given the increasing interest in using AI to augment or replace traditional processes. The authors' specific combination of AAE and CGR is a unique contribution, aiming to leverage the insights of both authors and the broader community to improve the quality of reviews. The paper also presents empirical evidence suggesting that the proposed system generates reviews of higher quality, with improved reproducibility and better alignment with reviewer opinions, compared to a single-LLM baseline. The authors' use of real conference data from AAAI, CoRLL, and ICML adds credibility to their findings. The survey of human reviewers, which indicates a preference for the system-generated reviews, further supports the paper's claims. The paper's focus on addressing the limitations of traditional peer review, such as subjectivity and time consumption, is also commendable. The authors' attempt to create a more objective and efficient review process is a valuable contribution to the field. The paper's exploration of LLMs in the context of peer review is a significant step towards automating and improving the review process. The authors' identification of the potential for LLMs to generate high-quality reviews is a promising direction for future research. The paper's focus on the practical application of LLMs in a real-world setting is also a strength. The authors' work highlights the potential for AI to transform the way academic papers are reviewed.
After a thorough examination of the paper, I've identified several significant weaknesses that warrant careful consideration. First, the paper's presentation is notably poor, with numerous formatting errors, typos, and inconsistencies that detract from its readability and credibility. For instance, the title on page 2 is obscured, and there are instances of incorrect formatting, such as the extra space before a citation on page 2. The use of terms like "al" instead of "et al." and the presence of unusual characters further contribute to the impression of a lack of attention to detail. These issues, while seemingly minor, collectively undermine the paper's professional appearance and make it harder to follow the arguments. Second, the paper lacks a clear and detailed explanation of the proposed system. While the authors introduce the concepts of Author-Assisted Evaluation (AAE) and Community-Guided Review (CGR), the precise mechanisms of how these components interact with the LLM reviewer are not fully elaborated. The paper does not provide a comprehensive system diagram or a step-by-step description of the process, making it difficult to understand the exact flow of information and the roles of each component. This lack of clarity hinders the reader's ability to fully grasp the proposed approach. Third, the paper's experimental design has several limitations. The paper does not provide sufficient details about the LLM used, the prompts employed, or the specific instructions given to human participants. This lack of transparency makes it difficult to assess the validity and generalizability of the findings. The paper also lacks a clear description of the metrics used, including how they are computed and what constitutes a good value. The absence of a baseline for the "owner" scores and the lack of a detailed explanation of the Z-test further limit the interpretability of the results. The paper's reliance on a single baseline (single-LLM) also limits the scope of the evaluation. The paper does not compare its system against other existing AI-assisted review systems or traditional human review, making it difficult to assess its relative performance. Fourth, the paper's claims about the generalizability of the system to other domains are not supported by any empirical evidence. The paper is exclusively evaluated in the context of AI conferences, and there is no discussion of how the system might be adapted to other fields with different review standards and practices. This lack of evidence limits the paper's impact and raises questions about the system's broader applicability. Fifth, the paper does not adequately address the potential for bias in the system. The paper does not discuss how the system mitigates biases that may be present in the training data of the LLM or how it handles situations where the community review is itself biased. The absence of a discussion on ethical considerations further limits the paper's practical relevance. Finally, the paper's writing style is often unclear and difficult to follow. The use of undefined jargon, such as "owner score," and the lack of clear explanations for key concepts make it challenging for the reader to fully understand the paper's contributions. The paper's overall structure and organization also contribute to the difficulty in understanding the paper's main arguments. These weaknesses, taken together, significantly limit the paper's impact and raise questions about the validity and generalizability of its findings. The lack of clarity, the limitations in the experimental design, and the absence of a thorough discussion of potential biases all need to be addressed to strengthen the paper's contribution.
To address the identified weaknesses, I recommend several concrete improvements. First, the authors should thoroughly revise the paper to correct all formatting errors, typos, and inconsistencies. This includes ensuring that the title is not obscured, that citations are formatted correctly, and that the paper is free of any unusual characters or errors. The authors should also carefully proofread the paper to ensure that it is free of any grammatical or spelling errors. Second, the authors should provide a more detailed and clear explanation of the proposed system. This should include a comprehensive system diagram that illustrates the flow of information and the roles of each component. The authors should also provide a step-by-step description of the process, explaining how the AAE and CGR components interact with the LLM reviewer. The authors should also clearly define all key terms and concepts, such as "owner score," and provide examples to illustrate their meaning. Third, the authors should provide more details about the experimental design. This includes specifying the LLM used, the prompts employed, and the specific instructions given to human participants. The authors should also provide a clear description of the metrics used, including how they are computed and what constitutes a good value. The authors should also provide a baseline for the "owner" scores and a detailed explanation of the Z-test. The authors should also consider including additional baselines in their evaluation, such as other existing AI-assisted review systems or traditional human review. Fourth, the authors should provide empirical evidence to support their claims about the generalizability of the system to other domains. This could involve conducting experiments in other fields with different review standards and practices. The authors should also discuss how the system might be adapted to other domains and what challenges might arise. Fifth, the authors should address the potential for bias in the system. This should include a discussion of how the system mitigates biases that may be present in the training data of the LLM and how it handles situations where the community review is itself biased. The authors should also include a discussion of the ethical considerations associated with using LLMs in the peer review process. Finally, the authors should revise the paper to improve its writing style and overall organization. This includes using clearer and more concise language, avoiding jargon, and providing clear explanations for all key concepts. The authors should also ensure that the paper is well-structured and easy to follow. By addressing these weaknesses, the authors can significantly strengthen their paper and enhance its contribution to the field.
Several key questions arise from my analysis of this paper. First, what specific mechanisms are in place to ensure that the author-assisted evaluation (AAE) does not introduce bias into the review process? Given that authors are evaluating their own work, how does the system mitigate the potential for inflated scores or overly positive assessments? Second, how does the community-guided review (CGR) component handle situations where the community review is itself biased or influenced by factors unrelated to the quality of the paper? What steps are taken to ensure that the community feedback is representative and reliable? Third, what specific LLM was used in the experiments, and what were the exact prompts employed to generate the reviews? Providing these details is crucial for reproducibility and for understanding the limitations of the system. Fourth, how was the survey of human reviewers conducted, and what were the specific instructions given to the participants? Understanding the methodology of the survey is essential for interpreting the results and assessing the validity of the findings. Fifth, what are the limitations of the system, and under what conditions might it fail to produce high-quality reviews? The paper should acknowledge the potential limitations of the approach and discuss the scenarios where it might not be effective. Sixth, how does the system handle situations where the author's self-assessment and the community feedback are in conflict? What mechanisms are in place to resolve such disagreements and ensure that the final review is fair and objective? Finally, what are the ethical implications of using LLMs in the peer review process, and how does the system address issues such as transparency, accountability, and fairness? The paper should include a thorough discussion of the ethical considerations associated with the proposed approach.