2511.0009 A PILOT STUDY EVALUATING LARGE LANGUAGE MODELS AS REVIEWERS AT ACADEMIC CONFERENCES v1

🎯 ICAIS2025 Accepted Paper

🎓 Meta Review & Human Decision

Decision:

Accept

Meta Review:

AI Review from DeepReviewer

AI Review available after:
--d --h --m --s

📋 AI Review from DeepReviewer will be automatically processed

📋 Summary

This paper introduces a novel peer review system designed to enhance objectivity, efficiency, and community involvement in the evaluation of academic papers, particularly within the context of AI conferences. The core of the proposed system lies in the integration of Author-Assisted Evaluation (AAE) and Community-Guided Review (CGR). In AAE, authors provide an initial assessment of their own work, including a score and justification, which is then fed into the review process. CGR, on the other hand, leverages the broader community by collecting ratings from a pool of reviewers, which are then combined with a score generated by a Large Language Model (LLM). The final review score is derived by averaging the author's score, the community ratings, and the LLM's score. The authors evaluate their system using data from three major AI conferences and a survey of reviewers, finding that their system's reviews are superior to single-LLM-based reviews, exhibiting reduced subjectivity and enhanced quality. The paper also suggests that single-LLM-based reviews are more likely to be rejected by the program committee after author major revisions. The authors conclude that their system offers a promising approach to mitigating the arbitrary nature of current peer review processes and can serve as a catalyst for further exploration of new review systems. The study's significance lies in its attempt to address the inherent limitations of traditional peer review by incorporating author input and community feedback, while also leveraging the capabilities of LLMs. However, the paper also acknowledges certain limitations, such as the lack of resources to fully implement all components of the system and the need for further investigation into potential biases. The authors' work represents a step towards a more objective and community-driven approach to academic peer review, but also highlights the need for further research and refinement of such systems.

✅ Strengths

The paper's primary strength lies in its innovative approach to academic peer review, which combines author-assisted evaluation with community-guided review and LLM-generated scores. This hybrid model presents a significant departure from traditional methods, aiming to enhance objectivity and reduce the subjectivity often associated with peer review. The integration of author input, while potentially problematic, is presented as a way to provide context and reduce the arbitrariness of the review process. The use of community ratings also introduces a broader perspective, potentially mitigating individual biases and incorporating diverse viewpoints. The authors' decision to evaluate their system using data from three major AI conferences and a survey of reviewers provides a strong empirical basis for their claims. This real-world data, coupled with statistical analysis, lends credibility to their findings. The paper also addresses a critical issue in the academic community by proposing a new peer review system that is more objective, efficient, and community-guided. The authors provide evidence that their system outperforms single-LLM-based reviews in terms of quality and alignment with program committee and author opinions. The paper is well-structured and easy to follow, with clear definitions of the proposed system and detailed explanations of the different components. The experimental setup is well-described, and the results are presented in a clear and concise manner. The paper has the potential to make a significant impact on the academic community by inspiring the exploration of new review systems. The authors' system could help to reduce the arbitrary nature of the current peer review system and improve the overall quality of academic research. The authors also acknowledge the limitations of their work, which is a sign of intellectual honesty and a commitment to rigorous research. The paper's focus on addressing the challenges of traditional peer review, such as time consumption and subjectivity, is also a notable strength. The authors' attempt to leverage LLMs in a more nuanced way, rather than relying solely on them, is a valuable contribution to the field.

❌ Weaknesses

After a thorough examination of the paper, I've identified several key weaknesses that warrant careful consideration. First, the paper lacks a clear and detailed explanation of how the Author-Assisted Evaluation (AAE) and Community-Guided Review (CGR) components are integrated to improve the review process. While the paper describes the sequential nature of these components, with the author's input being provided to the LLM before community ratings are considered, the exact mechanism of how the LLM uses the author's score and justification remains unclear. The paper states that the LLM uses the author's input to generate 'more informed and less subjective reviews,' but it does not provide a detailed algorithmic description of this process. Furthermore, the paper does not specify how the author's input is weighted relative to the community's ratings. The final score is presented as an average of the author's score, community ratings, and the LLM's score, implying equal weighting, but this is not explicitly stated or justified. This lack of clarity raises concerns about the potential for bias introduced by the author's self-assessment. The paper does not fully explore the potential for authors to inflate their own scores or how this might skew the overall review process. The absence of a detailed explanation of the integration process and the lack of bias mitigation strategies are significant limitations. My confidence in this weakness is high, as the paper's description of the integration process is high-level and lacks the necessary detail to fully understand the system's inner workings. Second, the paper does not provide a detailed comparison of the proposed system with existing peer review systems. While the paper mentions the limitations of traditional peer review, it lacks a direct quantitative or qualitative comparison against it or other alternative peer review systems, such as open review or post-publication review. The paper focuses primarily on comparing its system to a single-LLM baseline and a specific prior work (ReviewerGPT), but it does not provide a comprehensive analysis of how the proposed system compares to other approaches in terms of effectiveness, efficiency, and objectivity. The paper does not address key metrics such as time to review, reviewer satisfaction, or the overall quality of the reviews produced by different systems. This lack of comparative analysis makes it difficult to assess the true value and potential impact of the proposed system. My confidence in this weakness is high, as the paper's experimental section focuses on comparisons with LLM-based systems, and the 'Related Work' section lacks a detailed comparative analysis with traditional and other alternative peer review systems. Third, the paper's experimental design is limited by the fact that the authors did not have the opportunity to analyze the impact of allowing reviewers to see the author's name versus not allowing them to see this information. This is a crucial aspect of the peer review process, as it can significantly impact the objectivity of the reviews. The paper acknowledges the possibility of single-blind review within its algorithm, but the experimental design does not explicitly test the impact of revealing author names on review objectivity. The paper also does not explore different reviewer assignment strategies, such as assigning reviewers based on their expertise versus assigning them randomly. This lack of experimentation limits the paper's ability to draw strong conclusions about the effectiveness of the proposed system. My confidence in this weakness is high, as the paper's algorithm includes a conditional for single-blind review, but the experiments do not explicitly test the impact of blinding or different assignment strategies. Finally, the paper's writing style is not very engaging, and the authors could do a better job of highlighting the significance of their work and its potential impact on the academic community. The paper lacks a clear and compelling narrative that would motivate the reader to engage with the proposed system. The paper also does not provide a thorough discussion of the related work and clearly articulate how their system differs from and improves upon existing approaches. The paper would benefit from a more nuanced discussion of the trade-offs between the proposed system and the traditional peer review process. My confidence in this weakness is medium, as the paper follows a standard academic structure, but the narrative could be more compelling, and the comparison with existing work could be more nuanced.

💡 Suggestions

To address the identified weaknesses, I recommend several concrete improvements. First, the authors should provide a detailed algorithmic description of how the Author-Assisted Evaluation (AAE) and Community-Guided Review (CGR) components interact. This should include a clear explanation of how the author's self-assessment is incorporated into the overall review process, including any weighting or normalization applied. The authors should also discuss potential biases introduced by the author's input and how the system mitigates these biases. For example, the paper could describe a mechanism that down-weights author input if it significantly deviates from the community's assessment, or uses a calibration method to ensure consistency across different authors. A concrete example of how the system handles a specific review scenario would also be beneficial to illustrate the process. Second, the authors should include a more comprehensive analysis of the proposed system's performance relative to other methods. This should include a quantitative comparison of key metrics such as review quality, time to review, and reviewer satisfaction. The authors should also discuss the advantages and disadvantages of the proposed system compared to traditional peer review and other automated systems. For example, the paper could compare the proposed system to a traditional peer review process in terms of the number of reviews required, the time taken to complete the review process, and the quality of the reviews produced. A table summarizing these comparisons would be beneficial. Furthermore, the authors should discuss the limitations of the proposed system and potential areas for future improvement. Third, the authors should conduct experiments to analyze the impact of allowing reviewers to see the author's name versus not allowing them to see this information. This could involve comparing the objectivity of reviews under single-blind and double-blind conditions. The authors should also analyze the impact of different reviewer assignment strategies on the quality of the reviews. For example, they could compare the performance of their system when reviewers are assigned based on their expertise versus when they are assigned randomly. This would provide valuable insights into the robustness of the proposed system under different conditions. Fourth, the authors should enhance the writing style to create a more engaging narrative and clearly articulate the significance and impact of their work. The paper should also provide a more thorough discussion of the related work and clearly articulate how their system differs from and improves upon existing approaches. The paper would benefit from a more nuanced discussion of the trade-offs between the proposed system and the traditional peer review process. Finally, the authors should consider a more in-depth analysis of the potential biases introduced by their system. While the paper mentions the use of author-assisted evaluation (AAE) and community-guided review (CGR), it does not fully explore the potential for these components to introduce bias. For example, authors might overestimate the quality of their own work, leading to inflated scores in the AAE phase. Similarly, the community might be biased towards certain institutions or researchers, which could skew the results of the CGR phase. To address these concerns, the authors should conduct a thorough analysis of the potential biases and propose methods to mitigate them. This could involve comparing the scores assigned by authors and the community, analyzing the distribution of scores across different institutions, and exploring the use of techniques such as blind review to reduce bias. Furthermore, the authors should consider the impact of the community's composition on the review process. If the community is not diverse or representative of the broader research community, it could lead to biased or unfair reviews. The authors should discuss how they would ensure that the community is diverse and representative and how they would address any potential biases that might arise from the community's composition.

❓ Questions

Based on my analysis, I have several questions that I believe are crucial for a deeper understanding of the proposed system. First, could the authors provide more details on the resources needed to build the components of their system that they were unable to implement during the conference? What are the potential benefits of these components, and how would they improve the system's performance? This information would help to assess the feasibility of fully implementing the proposed system in the future. Second, how would the authors address the issue of reviewer availability? What strategies would they use to ensure that there are enough reviewers available to review all submissions? This is a critical practical challenge that needs to be addressed for the system to be viable. Third, have the authors considered the potential for bias in the author-assisted evaluation (AAE) and community-guided review (CGR) components of their system? How would they mitigate these biases? This is a crucial question, as the system's effectiveness depends on its ability to minimize bias. Fourth, what is the impact of allowing reviewers to see the author's name on the objectivity of the reviews? Have the authors conducted any experiments to explore this aspect of their system? This is a critical factor that needs to be investigated to ensure the fairness and objectivity of the review process. Fifth, how does the proposed system compare to other alternative peer review systems? What are the strengths and weaknesses of the system compared to other approaches? This comparison is essential to understand the unique contributions of the proposed system and its potential impact on the academic community. Finally, what are the limitations of the proposed system, and what are the potential areas for future improvement? This question is important to acknowledge the challenges of implementing the system in practice and to encourage further research in this area.

📊 Scores

Soundness:2.0
Presentation:1.5
Contribution:2.0
Rating: 4.0

AI Review from ZGCA

ZGCA Review available after:
--d --h --m --s

📋 AI Review from ZGCA will be automatically processed

📋 Summary

The paper proposes a hybrid peer-review system that integrates Author-Assisted Evaluation (AAE) and Community-Guided Review (CGR) within an LLM-mediated pipeline. In this system, authors provide self-scores and justifications (AAE), community members provide pre-review ratings (CGR), and the LLM synthesizes these signals to produce a review. Notably, the LLM is not provided the paper’s full text; instead, it "picks a random ordinate" and generates a caption and rating (Section 2: "Our work, however, provides a rating only by asking the LLM to pick a random ordinate without providing it with the full text of the paper."). The paper compares against two rule-based LLM baselines: a "single-LLM" reviewer and ReviewerGPT (Liu & Shah, 2023). Using data from AAAI 2024 and CoRLL 2024 (and ICML 2023 PC data for correlation analyses), the authors report that reviewers preferred reviews produced by their system on certain metrics (e.g., "Reproducibility and Quality," "Review Quality," and alignment), and that single-LLM reviews were more often rejected by the PC than the proposed system’s reviews. The evaluation includes survey data from 9 reviewers and various acceptance-rate and correlation analyses.

✅ Strengths

  • Addresses an important and timely problem: improving peer review quality and consistency in large AI conferences (Introduction).
  • Proposes a hybrid framework combining known signals (author self-assessment, community ratings) with an LLM to structure reviews (Algorithm 1; Definition 1).
  • Attempts multi-faceted evaluation across acceptance outcomes, reviewer preferences, and score correlations (Sections 3.1–3.4).
  • Engages with relevant related work on AAE and LLMs in review (Section 7), situating the contribution within current debates.

❌ Weaknesses

  • Core methodological limitation: the LLM does not process the paper’s content. The system "provides a rating only by asking the LLM to pick a random ordinate without providing it with the full text of the paper" (Section 2). This undermines claims about improved review quality from an LLM reviewer and conflates signal aggregation with reviewing.
  • Inadequate baselines and comparisons. The paper explicitly states it does not compare to the traditional review process (Introduction: "we do not compare our approach to the traditional peer review process"), yet some tables include HUMAN, and there is no strong baseline with LLMs that read the full paper text (ReviewerGPT is mentioned but the experimental setup is unclear and inconsistent).
  • Small-scale and underpowered human evaluation. The primary positive results on “Reproducibility and Quality” and “Review Quality” come from a survey of only 9 reviewers (Section 3.2), which raises serious questions about statistical power and generalizability.
  • Statistical reporting issues and inconsistencies. For example, a p-value of 0.495 is described as "marginally significantly lower" (Section 3.1); p-values and effect sizes are inconsistently reported; and some table entries and labels are confusing or contradictory.
  • Operationalization mismatch: metrics like "Reproducibility and Quality" (which the paper defines as potentially requiring running code) are assessed by a 9-person preference study and by an LLM that does not read the full text, which does not convincingly measure reproducibility or deep quality.
  • Clarity and presentation problems: algorithms contain typographical artifacts (e.g., "Ensure: ... author name =0"; "Output: ... Author Name =0"), duplicated or garbled tables, and ambiguous procedures (what exactly is a "random ordinate," how captions are generated without text). Overall organization and precision are lacking.
  • Ethical and procedural concerns: The system provides author identity to the LLM reviewer (Algorithm 1 steps 4–8; Definition 1 discussion), which conflicts with double-blind norms and may introduce bias. The paper does not adequately justify or mitigate this. The “Community-Guided Review” step risks popularity/brigading effects; recruitment, qualifications, and consent of community participants are not described in sufficient detail.
  • Causal claims about acceptance and rejection are weakly supported. Section 3.1 oscillates between claims of better performance and statements that the null cannot be rejected; the narrative does not cleanly reconcile these findings.

❓ Questions

  • Content access: How exactly does the LLM generate a caption (c = LLM_CAPTION(p)) without being given the full text? What inputs are provided to the LLM in practice? Please specify token-level inputs for AAE+CGR vs. baselines.
  • Random ordinate: What is a "random ordinate" operationally? Is it a figure, a sentence, a section header? How is it sampled? How many ordinates are used per paper? Is there any ablation on the number/selection of ordinates?
  • Baselines: Why not include a strong baseline where the LLM reads the full paper (as in ReviewerGPT or other LLM-as-a-judge settings)? Without this, how can we attribute gains to AAE/CGR rather than the specific (limited) LLM setup?
  • Human study power: The reviewer preference study uses n=9 raters. What is the number of paper-review pairs evaluated? What is the inter-rater reliability? Please provide power analyses or confidence intervals that appropriately reflect the small sample and multiple comparisons.
  • Metric definitions: How are "Reproducibility and Quality" and "Review Quality" concretely defined and measured in the survey? What exact rubrics did raters use? How does a 9-person survey capture reproducibility, which usually requires code execution or deep methodological assessment?
  • Statistical methodology: Several p-values in Tables 2–4 seem inconsistent with the narrative (e.g., calling p≈0.5 marginally significant). Please clarify your hypothesis tests, corrections for multiple comparisons, and report effect sizes with CIs consistently.
  • Use of author identity: Given double-blind norms, how do you justify revealing author names to the LLM? Did any of the involved conferences approve this? What safeguards were in place against identity-based bias? Please provide an ethics/IRB review or approval if applicable.
  • Community-Guided Review (CGR): Who are the "community members"? How were they recruited, screened for expertise, and incentivized? How do you prevent brigading or conflicts of interest, and how many ratings per paper were obtained?
  • Acceptance analyses: The text claims single-LLM reviews are more likely to be rejected, but Section 3.1 often finds "the null hypothesis cannot be rejected." Please reconcile these statements and provide a clear, consistent analysis, including base rates and confidence intervals.
  • Reproducibility: Will you release code, prompts, and de-identified data to allow others to reproduce your study? Can you provide an ablation to isolate the contributions of AAE vs. CGR vs. any LLM component?

⚠️ Limitations

  • The system’s LLM component does not access the full paper content, limiting its ability to assess technical novelty, correctness, or empirical rigor.
  • Very small-scale human study (n=9) drives key claims; results may not generalize to broader reviewer populations or different subfields.
  • Unclear definitions and measurement of key metrics (e.g., reproducibility) relative to the experimental procedures.
  • Potential bias introduction by revealing author identity to the LLM reviewer, directly at odds with double-blind evaluation norms.
  • CGR risks gaming/brigading effects if community membership, incentives, and conflict-of-interest screening are not well controlled.
  • Statistical analysis and reporting inconsistencies reduce confidence (e.g., p-values misinterpreted; lack of effect sizes and multiple-comparison control).
  • No strong baseline where LLMs read full papers, limiting claims about improvements over standard LLM reviewing approaches.

🖼️ Image Evaluation

Cross‑Modal Consistency: 20/50

Textual Logical Soundness: 12/30

Visual Aesthetics & Clarity: 8/20

Overall Score: 40/100

Detailed Evaluation (≤500 words):

Visual ground truth (tables only; no figures found)

• Table 1/2 blocks (HTML): multi-panel acceptance/score summaries for AAAI 2024, CoRLL 2024, ICML(2023/2024), with columns REJECTED/ACCEPTED, MEAN/STDEV/p‑value; rows AAE+CGR (duplicated), HUMAN; legends unclear.

• Table with REJ/ACC/REJ+/ACC‑ (two versions): means, stdevs, p‑values; dataset blocks AAAI/CoRLL; no definitions in-table.

• Tables 3–4: labeled “opinion-based” and “correlations,” but show similar numeric blocks; inconsistent labels/years.

• Table 5–8: “Correlations” and “Rebuttal Alignment,” small two-column correlation tables with p‑values and R/τ; repeated/duplicated content, year mismatch.

1. Cross‑Modal Consistency

• Major 1: Acceptance‑rate claim vs table contents mismatch; tables show means/p‑values, not acceptance rates. Evidence: “Table 2: Acceptance rates …” and “singleellm … 3% and 3.6%” (Sec 3.1).

• Major 2: Numerically inconsistent correlations (near‑1 correlations with non‑significant p‑values). Evidence: Table 5 “p-value 0.365, R 0.990”.

• Major 3: Year/dataset mismatch (ICML 2023 in text vs ICML 2024 in tables). Evidence: Table caption block shows “ICML 2024” while Sec 2.1 uses “ICML 2023”.

• Major 4: Survey precision implausible for n=9 (very small ±0.007 SE). Evidence: Sec 3.2 “9 reviewers” and “0.427 ± 0.007”.

• Minor 1: Table numbering/order inconsistent (Table 2 described before Table 1).

• Minor 2: Repeated unlabeled “AAE+CGR” rows under the same dataset in a single table.

2. Text Logic

• Major 1: Core terms “owner” vs “author” used inconsistently, altering meaning of analyses. Evidence: Sec 3.4 “owner’s rating is higher than the author’s own score”.

• Major 2: Methodological claim lacks supporting experiment: “Pick Random Ordinates … better than randomly selecting a sentence” without evidence. Evidence: Sec 2 “This approach is better …”.

• Minor 1: “two‑enzyme analysis” is unclear/likely typo for a statistical method. Evidence: Sec 3.3 “two‑enzyme analysis”.

• Minor 2: Algorithms contain stray “=0” and broken step numbering. Evidence: Algorithm 1 “author name =0”; lines 10–12 misnumbered.

3. Figure Quality

• Major 1: Tables fail figure‑alone test; undefined abbreviations (REJ/ACC/REJ+/ACC‑), missing method labels, duplicated rows. Evidence: Multi‑panel HTML tables with no in‑table legends.

• Minor 1: Dataset year labels inconsistent within tables (ICML 2023/2024).

• Minor 2: Captions do not explain what MEAN represents (scores? rates?).

Key strengths:

• Ambitious system proposal (AAE+CGR) with multi‑conference datasets.

• Attempt to triangulate with PC scores, reviewer surveys, and correlations.

Key weaknesses:

• Severe figure–text mismatches (acceptance rates vs means; year labels).

• Statistical inconsistencies and implausible precision for small n.

• Terminology ambiguity (“owner” vs “author”) undermines analyses.

• Tables lack self‑contained legends/definitions; duplicated/misnumbered content.

Recommended fixes (highest impact):

• Recompute and clearly separate acceptance rates vs score means; fix captions.

• Standardize dataset naming (ICML 2023) and correct correlation tables.

• Define REJ/ACC/REJ+/ACC‑ in-table; label all rows (AAE+CGR, single‑LLM, HUMAN).

• Clarify “owner” terminology; detail “random ordinate” experiments or remove the claim.

📊 Scores

Originality:2
Quality:1
Clarity:1
Significance:1
Soundness:1
Presentation:1
Contribution:1
Rating: 3

AI Review from SafeReviewer

SafeReviewer Review available after:
--d --h --m --s

📋 AI Review from SafeReviewer will be automatically processed

📋 Summary

This paper introduces a novel system for academic peer review, aiming to enhance objectivity, efficiency, and community involvement. The proposed system integrates author-assisted evaluation (AAE) and community-guided review (CGR) into the traditional review process for AI conferences. The core idea is to leverage large language models (LLMs) to synthesize reviews, incorporating input from both authors and the broader community. Specifically, authors provide a self-assessment of their work, including a score and justification, which is then fed into the LLM reviewer. Additionally, the system gathers ratings from a community of reviewers, which are also incorporated into the LLM's review generation process. The authors evaluated their system using data from three major AI conferences, comparing it against a baseline single-LLM review system. Their results suggest that the proposed system produces reviews of higher quality, with improved reproducibility and better alignment with reviewer opinions, leading to a lower rejection rate after major revisions. The authors also conducted a survey where human reviewers preferred the system-generated reviews over single-LLM reviews. The paper's central contribution lies in the specific combination of AAE and CGR within an LLM-based review system, aiming to address the limitations of traditional peer review, such as subjectivity and time consumption. The authors argue that their system offers a more robust and reliable approach to evaluating academic papers. However, the paper's presentation and experimental design have several limitations that need to be addressed to fully validate the proposed approach. The paper's findings suggest that incorporating author input and community feedback can enhance the quality of LLM-generated reviews, but further work is needed to address the identified weaknesses and to explore the generalizability of the system to other domains.

✅ Strengths

The paper's primary strength lies in its exploration of a novel approach to academic peer review, integrating author-assisted evaluation and community-guided review within an LLM-based system. This is a timely and relevant area of research, given the increasing interest in using AI to augment or replace traditional processes. The authors' specific combination of AAE and CGR is a unique contribution, aiming to leverage the insights of both authors and the broader community to improve the quality of reviews. The paper also presents empirical evidence suggesting that the proposed system generates reviews of higher quality, with improved reproducibility and better alignment with reviewer opinions, compared to a single-LLM baseline. The authors' use of real conference data from AAAI, CoRLL, and ICML adds credibility to their findings. The survey of human reviewers, which indicates a preference for the system-generated reviews, further supports the paper's claims. The paper's focus on addressing the limitations of traditional peer review, such as subjectivity and time consumption, is also commendable. The authors' attempt to create a more objective and efficient review process is a valuable contribution to the field. The paper's exploration of LLMs in the context of peer review is a significant step towards automating and improving the review process. The authors' identification of the potential for LLMs to generate high-quality reviews is a promising direction for future research. The paper's focus on the practical application of LLMs in a real-world setting is also a strength. The authors' work highlights the potential for AI to transform the way academic papers are reviewed.

❌ Weaknesses

After a thorough examination of the paper, I've identified several significant weaknesses that warrant careful consideration. First, the paper's presentation is notably poor, with numerous formatting errors, typos, and inconsistencies that detract from its readability and credibility. For instance, the title on page 2 is obscured, and there are instances of incorrect formatting, such as the extra space before a citation on page 2. The use of terms like "al" instead of "et al." and the presence of unusual characters further contribute to the impression of a lack of attention to detail. These issues, while seemingly minor, collectively undermine the paper's professional appearance and make it harder to follow the arguments. Second, the paper lacks a clear and detailed explanation of the proposed system. While the authors introduce the concepts of Author-Assisted Evaluation (AAE) and Community-Guided Review (CGR), the precise mechanisms of how these components interact with the LLM reviewer are not fully elaborated. The paper does not provide a comprehensive system diagram or a step-by-step description of the process, making it difficult to understand the exact flow of information and the roles of each component. This lack of clarity hinders the reader's ability to fully grasp the proposed approach. Third, the paper's experimental design has several limitations. The paper does not provide sufficient details about the LLM used, the prompts employed, or the specific instructions given to human participants. This lack of transparency makes it difficult to assess the validity and generalizability of the findings. The paper also lacks a clear description of the metrics used, including how they are computed and what constitutes a good value. The absence of a baseline for the "owner" scores and the lack of a detailed explanation of the Z-test further limit the interpretability of the results. The paper's reliance on a single baseline (single-LLM) also limits the scope of the evaluation. The paper does not compare its system against other existing AI-assisted review systems or traditional human review, making it difficult to assess its relative performance. Fourth, the paper's claims about the generalizability of the system to other domains are not supported by any empirical evidence. The paper is exclusively evaluated in the context of AI conferences, and there is no discussion of how the system might be adapted to other fields with different review standards and practices. This lack of evidence limits the paper's impact and raises questions about the system's broader applicability. Fifth, the paper does not adequately address the potential for bias in the system. The paper does not discuss how the system mitigates biases that may be present in the training data of the LLM or how it handles situations where the community review is itself biased. The absence of a discussion on ethical considerations further limits the paper's practical relevance. Finally, the paper's writing style is often unclear and difficult to follow. The use of undefined jargon, such as "owner score," and the lack of clear explanations for key concepts make it challenging for the reader to fully understand the paper's contributions. The paper's overall structure and organization also contribute to the difficulty in understanding the paper's main arguments. These weaknesses, taken together, significantly limit the paper's impact and raise questions about the validity and generalizability of its findings. The lack of clarity, the limitations in the experimental design, and the absence of a thorough discussion of potential biases all need to be addressed to strengthen the paper's contribution.

💡 Suggestions

To address the identified weaknesses, I recommend several concrete improvements. First, the authors should thoroughly revise the paper to correct all formatting errors, typos, and inconsistencies. This includes ensuring that the title is not obscured, that citations are formatted correctly, and that the paper is free of any unusual characters or errors. The authors should also carefully proofread the paper to ensure that it is free of any grammatical or spelling errors. Second, the authors should provide a more detailed and clear explanation of the proposed system. This should include a comprehensive system diagram that illustrates the flow of information and the roles of each component. The authors should also provide a step-by-step description of the process, explaining how the AAE and CGR components interact with the LLM reviewer. The authors should also clearly define all key terms and concepts, such as "owner score," and provide examples to illustrate their meaning. Third, the authors should provide more details about the experimental design. This includes specifying the LLM used, the prompts employed, and the specific instructions given to human participants. The authors should also provide a clear description of the metrics used, including how they are computed and what constitutes a good value. The authors should also provide a baseline for the "owner" scores and a detailed explanation of the Z-test. The authors should also consider including additional baselines in their evaluation, such as other existing AI-assisted review systems or traditional human review. Fourth, the authors should provide empirical evidence to support their claims about the generalizability of the system to other domains. This could involve conducting experiments in other fields with different review standards and practices. The authors should also discuss how the system might be adapted to other domains and what challenges might arise. Fifth, the authors should address the potential for bias in the system. This should include a discussion of how the system mitigates biases that may be present in the training data of the LLM and how it handles situations where the community review is itself biased. The authors should also include a discussion of the ethical considerations associated with using LLMs in the peer review process. Finally, the authors should revise the paper to improve its writing style and overall organization. This includes using clearer and more concise language, avoiding jargon, and providing clear explanations for all key concepts. The authors should also ensure that the paper is well-structured and easy to follow. By addressing these weaknesses, the authors can significantly strengthen their paper and enhance its contribution to the field.

❓ Questions

Several key questions arise from my analysis of this paper. First, what specific mechanisms are in place to ensure that the author-assisted evaluation (AAE) does not introduce bias into the review process? Given that authors are evaluating their own work, how does the system mitigate the potential for inflated scores or overly positive assessments? Second, how does the community-guided review (CGR) component handle situations where the community review is itself biased or influenced by factors unrelated to the quality of the paper? What steps are taken to ensure that the community feedback is representative and reliable? Third, what specific LLM was used in the experiments, and what were the exact prompts employed to generate the reviews? Providing these details is crucial for reproducibility and for understanding the limitations of the system. Fourth, how was the survey of human reviewers conducted, and what were the specific instructions given to the participants? Understanding the methodology of the survey is essential for interpreting the results and assessing the validity of the findings. Fifth, what are the limitations of the system, and under what conditions might it fail to produce high-quality reviews? The paper should acknowledge the potential limitations of the approach and discuss the scenarios where it might not be effective. Sixth, how does the system handle situations where the author's self-assessment and the community feedback are in conflict? What mechanisms are in place to resolve such disagreements and ensure that the final review is fair and objective? Finally, what are the ethical implications of using LLMs in the peer review process, and how does the system address issues such as transparency, accountability, and fairness? The paper should include a thorough discussion of the ethical considerations associated with the proposed approach.

📊 Scores

Soundness:2.0
Presentation:1.0
Contribution:1.5
Confidence:3.25
Rating: 2.5

Keywords

Click the button to extract keywords

Insights

Click the button to extract insights
Version 1
Citation Tools

📝 Cite This Paper