📋 AI Review from DeepReviewer will be automatically processed
📋 AI Review from ZGCA will be automatically processed
The paper proposes RAG-MCP, a retrieval-augmented framework for LLM tool selection within the Model Context Protocol (MCP) ecosystem. Instead of prompting the LLM with all available MCP tool schemas (which causes prompt bloat and selection errors), RAG-MCP maintains an external vector index of MCP metadata and retrieves the top-k relevant tools for a given user query. Only the selected tool's schema is injected into the LLM prompt, reducing context size and simplifying decisions (Section 3.2). The authors also design an 'MCP stress test' varying the number of candidate tools N up to 11,100 (Section 4.1) and evaluate on a web-search subset of MCPBench (Luo et al., 2025) with baselines: Blank Conditioning, simple keyword pre-filter ('Actual Match'), and RAG-MCP (Sections 4.2.2–4.2.4). They report that RAG-MCP reduces prompt tokens (e.g., 1084 vs 2133.84) and improves tool selection accuracy (43.13% vs 13.62% for Blank), claiming improved scalability and extensibility.
Cross‑Modal Consistency: 32/50
Textual Logical Soundness: 22/30
Visual Aesthetics & Clarity: 13/20
Overall Score: 67/100
Detailed Evaluation (≤500 words):
1. Cross‑Modal Consistency
Visual ground truth
• Figure 1: two panels. (a) MCP: stacked blocks “Prompt/LLM/Query/(MCP) System/MCP Result/LLM Response”. (b) RAG‑MCP: adds “RAG‑MCP Tool MCP” before invocation; similar blocks after.
• Figure 2: three‑stage pipeline with icons: Query Encoding (Qwen Retriever) → Vector Search & Validation (Top‑k MCPs) → LLM Invocation (f()).
• Figure 3: heatmap; x‑axis “MCP Number”, y‑axis “Key MCP Position”; yellow=success, purple=failure; many axis ticks illegible.
• Major 1: Fig. 3 does not “plot selection accuracy vs N”; it’s a success heatmap by position, not an accuracy curve, conflicting with Sec. 4.1.2’s description and the claim of a “non‑monotonic trend” with N. Evidence: “Figure 3 plots selection accuracy and task success as N increases.” (Sec 4.1.2)
• Major 2: Evaluator inconsistency between sections. Evidence: “we employ Deepseek‑v3 as our evaluator” (Sec 4.2.1) vs “Judgment… Llama‑based verifier (‘Llama as Judge’)” (Sec 4.2.3).
• Minor 1: Naming flip between “RAG‑MCP” and “MCP‑RAG” across text and table (Table 1 header shows “MCP‑RAG”; method elsewhere “RAG‑MCP”).
• Minor 2: Fig. 2 caption says “Qwen‑max” while figure text shows “Qwen Retriever”.
• Minor 3: “over 50%” token reduction (Abstract/Conclusion) vs Table 1 1084 vs 2133.84 ≈ 49.2%.
2. Text Logic
• Major 1: Stress‑test setup (Sec 4.1.1) presents all N schemas in‑prompt (no retrieval), yet Sec 4.1.2 attributes mitigation to MCP/RAG in that same test, blurring whether retrieval was used in the stress test. Evidence: “present the model with N MCP schemas… ask it to select…,” (Sec 4.1.1) vs “MCP‑RAG greatly mitigates prompt bloat.” (Sec 4.1.2)
• Minor 1: Claims of “Resource Efficiency… instantiate only the selected MCP” and “Multi‑Turn Robustness” lack quantitative evidence in Experiments.
• Minor 2: Baseline naming (“Blank” vs “Blank Conditioning”) and dataset description (MCPBench subset) are terse, limiting reproducibility details.
3. Figure Quality
• Major 1: Fig. 3 illegible at print size: axis ticks/numbers for N up to 11100 are too small; colorbar text thin; values hard to read. Evidence: Figure 3.
• Minor 1: Fig. 3 lacks a clear legend entry for “success rate per N”; consider adding aggregated accuracy curve or marginal plots.
• Minor 2: Fig. 1/2 are readable but would benefit from call‑outs linking boxes to section names (e.g., “Retrieval (Sec 3.2)”).
Key strengths:
Key weaknesses:
Recommendations:
📋 AI Review from SafeReviewer will be automatically processed
This paper introduces RAG-MCP, a framework designed to enhance the efficiency of large language models (LLMs) when utilizing external tools, specifically within the Model Context Protocol (MCP). The core idea is to employ a retrieval-augmented generation approach to address the issue of prompt bloat, which occurs when the LLM is presented with descriptions of all available tools. Instead of providing the LLM with all tool descriptions, RAG-MCP uses a retriever model to select the most relevant tools based on the user's query, thereby reducing the prompt size and simplifying the tool selection process. The method involves three key steps: retrieval, where a lightweight LLM encodes the user's task and performs a semantic search over an index of MCP metadata; validation, where a few-shot example query is generated and tested for compatibility; and invocation, where the selected tool's schema is injected into the LLM's prompt. The authors evaluate RAG-MCP on a web search benchmark, comparing it against two baselines: 'Blank Conditioning,' where the LLM is provided with all tool descriptions, and 'Actual Match,' which uses keyword matching to filter the tool set. The results demonstrate that RAG-MCP significantly reduces prompt token usage and improves tool selection accuracy compared to the baselines. The paper highlights the potential of RAG-MCP to enable more scalable and accurate tool integration for LLMs, particularly in scenarios with a large number of available tools. However, the paper also acknowledges limitations, such as the performance degradation of the retrieval mechanism when the tool registry scales to thousands of MCPs. Overall, the paper presents a practical approach to addressing prompt bloat in tool-using LLMs, but it also reveals areas where further research and development are needed to enhance its robustness and generalizability.
I found the paper's core idea of applying retrieval-augmented generation to the problem of tool selection in LLMs to be quite intuitive and well-motivated. The issue of prompt bloat, especially when dealing with a large number of tools, is a significant challenge, and the proposed RAG-MCP framework offers a practical solution. The paper clearly articulates the problem and the proposed approach, making it easy to follow. The experimental results, while limited in scope, do demonstrate the effectiveness of RAG-MCP in reducing prompt token usage and improving tool selection accuracy compared to the baselines. The authors' decision to use an external vector index for tool metadata allows for the addition of new tools without retraining the LLM, which is a significant advantage in dynamic environments. The paper also acknowledges the limitations of the approach, such as the performance degradation of the retrieval mechanism at scale, which I appreciate. The inclusion of a validation step, where a few-shot example query is generated and tested for compatibility, is a good addition that enhances the robustness of the framework. The paper's focus on the Model Context Protocol (MCP) is also a strength, as it addresses a specific and relevant standard for tool integration in LLMs. The paper's clear articulation of the problem, the proposed solution, and the experimental results makes it a valuable contribution to the field.
After a thorough examination of the paper, I've identified several weaknesses that warrant careful consideration. Firstly, the paper's technical contribution is somewhat limited. The core idea of using retrieval to select relevant tools is a straightforward application of standard RAG techniques. As Reviewer 1 pointed out, the method essentially involves embedding the tool descriptions and performing a maximum inner product search against the user query embedding. The paper does not explore more sophisticated retrieval methods, such as those incorporating tool functionality or user context, which could lead to more robust and accurate tool selection. The validation step, while a good addition, is also a relatively simple approach and does not fully address the complexities of tool validation. The paper's reliance on a single retriever model (Qwen-max) without exploring alternatives or providing a detailed analysis of its impact on the overall performance is another weakness. Secondly, the paper's experimental evaluation is limited in several aspects. The comparison is only made against two baselines, 'Blank Conditioning' and 'Actual Match,' which are relatively simple. The absence of comparisons with more sophisticated tool-selection methods, such as fine-tuned models or other retrieval-based approaches, makes it difficult to assess the true effectiveness of RAG-MCP. As Reviewer 2 noted, the 'Blank Conditioning' baseline, which involves providing the LLM with all tool descriptions, is not a realistic or practical approach, and its poor performance is somewhat expected. The lack of ablation studies to analyze the impact of different components of the RAG-MCP framework, such as the retriever model or the validation mechanism, is also a significant limitation. The paper also lacks a detailed analysis of the retrieval performance itself, such as precision and recall, which would provide more insight into the strengths and weaknesses of the approach. The evaluation is also limited to a single benchmark (MCPBench) and a single base LLM (Qwen-max), which raises concerns about the generalizability of the findings. As Reviewer 3 pointed out, the paper does not address the potential for error propagation due to the reliance on the retriever for correct tool selection. If the retriever fails, the entire process is likely to fail. The paper also lacks a detailed analysis of the computational cost of the proposed method, especially the overhead introduced by the retrieval step. The paper's discussion of the limitations of the approach is also somewhat brief. While the paper acknowledges the performance degradation of the retrieval mechanism at scale, it does not delve into the specific reasons for this degradation or propose concrete solutions. The paper also does not discuss the potential for bias in the retrieval process or how this might affect the fairness and reliability of the tool selection process. Finally, the paper's writing could be improved in several areas. The description of the validation step is brief and lacks detail, and the paper could benefit from a more in-depth discussion of the limitations of the approach and potential avenues for future research. The paper also lacks a clear explanation of the metrics used in the evaluation, and the analysis of the experimental results could be more detailed. The typo in the y-axis label of Figure 3, as pointed out by Reviewer 2, is also a minor but noticeable issue.
Based on the identified weaknesses, I recommend several concrete improvements for this paper. First, the authors should significantly expand their experimental evaluation. This includes comparing RAG-MCP against a wider range of baselines, including more sophisticated tool selection methods like fine-tuned models or other retrieval-based approaches. It would also be beneficial to include ablation studies to analyze the impact of different components of the RAG-MCP framework, such as the retriever model, the validation mechanism, and the size of the retrieved tool set. The authors should also provide a more detailed analysis of the retrieval performance itself, including metrics like precision and recall. Second, the authors should explore more sophisticated retrieval techniques beyond simple semantic similarity. This could involve incorporating tool functionality, user context, or task-specific information into the retrieval process. For example, they could explore methods that learn tool embeddings based on their usage patterns or that incorporate user feedback to refine tool selection over time. The authors should also consider the computational cost of the retrieval process and explore methods to optimize it for real-time applications. Third, the authors should investigate the robustness of RAG-MCP to noisy or ambiguous queries. This could involve testing the framework with a wider range of queries and analyzing its performance under different conditions. The authors should also explore methods to improve the validation step, such as using more sophisticated validation techniques or incorporating user feedback to refine the validation process. Fourth, the authors should address the potential for error propagation due to the reliance on the retriever for correct tool selection. This could involve exploring methods to mitigate the impact of retrieval errors, such as using a fallback mechanism or incorporating uncertainty into the tool selection process. The authors should also provide a more detailed analysis of the computational cost of the proposed method, especially the overhead introduced by the retrieval step. Fifth, the authors should expand the scope of their evaluation beyond the MCP framework. While MCP is a relevant context, the core idea of tool selection via retrieval could be applied to other tool-using LLM frameworks. Discussing the generalizability of RAG-MCP and how it could be adapted to different tool interfaces would significantly broaden the paper's impact. For example, the paper could explore how the retrieval mechanism could be modified to handle tools with different input and output specifications, or how the validation step could be adapted to different types of tool functionality. Sixth, the authors should provide a more detailed analysis of the limitations of the approach and potential avenues for future research. This includes discussing the potential for bias in the retrieval process and how this might affect the fairness and reliability of the tool selection process. Finally, the authors should improve the clarity and completeness of their writing. The description of the validation step should be expanded, and the paper should provide a more detailed explanation of the metrics used in the evaluation. The analysis of the experimental results should also be more detailed, and the paper should be carefully proofread to eliminate typos and other errors.
After reviewing the paper, I have several questions that I believe would benefit from further clarification. First, I'm curious about the specific details of the validation step. The paper mentions generating a few-shot example query and testing its response, but I'd like to understand more about how this is done in practice. What kind of examples are generated, and how is the response evaluated? How does the system handle cases where the example query is not representative of the actual task? Second, I'm interested in the choice of the retriever model. The paper uses Qwen-max, but it doesn't provide a detailed justification for this choice. What other retriever models were considered, and why was Qwen-max selected? How sensitive is the performance of RAG-MCP to the choice of the retriever model? Third, I'd like to know more about the computational cost of the proposed method. The paper mentions that RAG-MCP reduces prompt token usage, but it doesn't provide a detailed analysis of the computational overhead introduced by the retrieval step. How does the computational cost of RAG-MCP compare to other tool selection methods? Is the retrieval step fast enough for real-time applications? Fourth, I'm curious about the generalizability of RAG-MCP to other tool-using LLM frameworks. The paper focuses on the MCP framework, but how could the proposed method be adapted to different tool interfaces? What modifications would be needed to handle tools with different input and output specifications? Fifth, I'd like to understand more about the limitations of the approach. The paper mentions the performance degradation of the retrieval mechanism at scale, but what are the specific reasons for this degradation? What other limitations does the approach have, and what are the potential avenues for future research? Finally, I'm interested in the potential for bias in the retrieval process. How might bias in the tool descriptions or the retriever model affect the fairness and reliability of the tool selection process? What steps could be taken to mitigate this bias?