2510.0001 RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation v1

🎯 ICAIS2025 Accepted Paper

🎓 Meta Review & Human Decision

Decision:

Accept

Meta Review:

AI Review from DeepReviewer

AI Review available after:
--d --h --m --s

📋 AI Review from DeepReviewer will be automatically processed

📋 Summary

This paper introduces RAG-MCP, a novel framework designed to enhance the tool selection capabilities of large language models (LLMs) when dealing with a growing number of external tools. The core problem addressed is the issue of 'prompt bloat,' where the prompt size increases significantly as more tools are added, leading to decreased performance and efficiency. To mitigate this, RAG-MCP integrates Retrieval-Augmented Generation (RAG) with the Model Context Protocol (MCP). The framework operates in three key stages: retrieval, validation, and invocation. First, a retriever, based on a lightweight LLM, identifies the most relevant tools for a given user query by performing a semantic search over an external index of tool descriptions. Second, a validation step is employed to ensure the compatibility of the selected tools by generating a few-shot example query and testing the tool's response. Finally, only the best-matching tool description is provided to the LLM for execution. The authors present an 'MCP stress test' to evaluate the framework's performance under varying tool loads. The experimental results, primarily focused on a web search task, demonstrate that RAG-MCP significantly improves tool selection accuracy and reduces prompt token usage compared to baseline methods like 'Blank Conditioning' and 'Actual Match.' The paper argues that RAG-MCP offers a scalable and efficient solution for managing extensive toolsets in LLMs, ensuring that these models can effectively leverage external tools without suffering from performance degradation due to prompt bloat. While the paper presents a promising approach, my analysis reveals several areas where further investigation and refinement are needed to fully realize the potential of RAG-MCP.

✅ Strengths

I find the core idea of integrating RAG with MCP to address prompt bloat in LLM tool usage to be a significant strength of this paper. The problem of managing an ever-increasing number of tools is indeed a practical challenge, and the proposed RAG-MCP framework offers a compelling solution. The three-stage approach of retrieval, validation, and invocation is well-structured and logically sound. The use of a semantic retrieval mechanism to identify relevant tools before providing them to the LLM is a clever way to reduce the prompt size and improve tool selection accuracy. The experimental results, although limited in scope, do demonstrate the effectiveness of RAG-MCP in reducing prompt tokens and improving tool selection accuracy compared to the baseline methods. The 'MCP stress test' is a valuable addition, as it shows how the model's performance degrades with increasing prompt length, highlighting the importance of the retrieval mechanism. The paper is also well-written and clearly explains the proposed framework and its components. The use of figures and tables effectively illustrates the framework and findings. Overall, the paper presents a novel and practical approach to a relevant problem in the deployment of LLMs with external tools, and the initial results are promising. The idea of using a lightweight LLM for retrieval is also a good choice, as it balances efficiency and accuracy. The paper also identifies a critical problem in the field of LLM tool use, which is the prompt bloat issue, and the proposed solution is well-motivated.

❌ Weaknesses

My analysis reveals several significant weaknesses in this paper, primarily concerning the scope of the experimental evaluation, the lack of comparison with existing methods, and the insufficient detail provided for certain components. First, the paper's experimental evaluation is notably limited in scope. The primary dataset used is the 'web search subset of MCPBench,' which restricts the generalizability of the findings. As I've verified, the paper does not explore the framework's performance with tools that have complex input/output schemas or require multi-step interactions, which are common in real-world applications. This narrow focus limits the conclusions that can be drawn about the framework's robustness and adaptability. The lack of evaluation on tools beyond web search is a significant limitation, as it is unclear how RAG-MCP would perform in more diverse scenarios. Second, the paper lacks a thorough comparison with existing tool selection methods. As I've confirmed, the experimental section only compares RAG-MCP against 'Blank Conditioning' and 'Actual Match,' which are not representative of the state-of-the-art in tool selection. The paper does not compare against methods that use fine-tuning, in-context learning, or other retrieval mechanisms for tool selection, such as Toolformer or ReAct, which are mentioned in the related work section. This absence of comparison makes it difficult to gauge the relative performance and advantages of RAG-MCP. The paper also fails to compare against methods specifically designed for managing large toolsets, which is a critical omission given the paper's focus on scalability. This lack of comparison makes it hard to understand the specific advantages of RAG-MCP in the context of large toolsets. Third, the paper provides insufficient detail regarding the implementation of the validation step. While the paper describes the validation step as generating a few-shot example query and testing the tool's response, it does not provide specifics on how the synthetic examples are generated or how the response is tested. The paper does not provide a detailed analysis of the time overhead introduced by the validation step, nor does it discuss how this overhead scales with the number of tools or the complexity of the validation logic. This lack of detail makes it difficult to understand the practical implementation and potential limitations of the validation step. The paper also does not discuss the potential failure modes of the validation step, such as how it handles flawed generated queries or ambiguous tool schemas. Fourth, the paper does not adequately address the computational overhead introduced by the retrieval and validation steps. While the paper reports 'Avg Prompt Tokens' and 'Avg Completion Tokens,' it lacks any metrics or discussion related to the time spent on retrieval, validation, or invocation. This makes it difficult to assess the practical efficiency of the proposed approach, especially when compared to simpler methods. The paper does not provide a breakdown of the time spent on each component of the framework, which is crucial for understanding the trade-offs between accuracy gains and computational costs. Finally, while the paper demonstrates improvements in tool selection accuracy, it does not provide a clear link between these improvements and the quality of the final output or task completion. The paper does not provide a detailed analysis of how improved tool selection directly translates to better task completion beyond the reported success rate. The paper focuses primarily on tool selection accuracy and prompt token reduction, but it could benefit from a more in-depth analysis of the impact on the quality of the final output or task completion. The paper does not evaluate the quality of the generated text or the success rate of task completion when using RAG-MCP compared to baseline methods. These weaknesses significantly limit the paper's conclusions and the practical applicability of the proposed framework. My confidence in these identified issues is high, as they are consistently supported by the lack of specific information or analyses within the paper.

💡 Suggestions

To address the identified weaknesses, I recommend several concrete improvements. First, the authors should significantly broaden the scope of their experimental evaluation. This should include a diverse set of tools that vary not only in their functionality but also in the complexity of their input/output schemas and the number of steps required for successful execution. For example, tools that involve structured data manipulation, interaction with databases, or complex multi-step workflows should be included. This would provide a more comprehensive understanding of the framework's generalizability and robustness. Furthermore, the evaluation should include a quantitative analysis of the framework's performance across different tool categories, highlighting any variations in accuracy or efficiency. This would allow for a more nuanced understanding of the framework's strengths and weaknesses and provide valuable insights into its applicability in various real-world scenarios. Second, the authors should include a more thorough comparison with existing tool selection methodologies. Specifically, the authors should consider incorporating baselines that represent different approaches to tool selection, such as methods based on semantic similarity, reinforcement learning, or hybrid approaches. For instance, a comparison with a method that uses a simple keyword matching approach for tool selection would help to isolate the benefits of the RAG component. Furthermore, a comparison with a method that uses a more complex retrieval mechanism, such as a graph-based approach, would help to understand the limitations of the current RAG implementation. The current baselines are too simplistic and do not provide a strong enough comparison to demonstrate the effectiveness of the proposed approach. The authors should also consider using a more diverse set of tasks and datasets to evaluate the generalizability of the proposed method. The current evaluation is limited to a single dataset, which makes it difficult to assess the robustness of the method. Third, the authors need to provide a more detailed explanation of the validation component of the framework. The paper should provide a clear description of how the synthetic examples are generated, including the specific prompts or templates used. It is also important to describe how the LLM evaluates the results of the synthetic examples and how it determines whether a tool is valid or not. For example, what specific criteria are used to determine if a tool is valid? How does the LLM handle cases where the tool returns an error or unexpected output? The paper should also discuss the limitations of the validation component and how it might be improved. For example, how does the validation component handle cases where the tool is valid but the synthetic example is not representative of the actual use case? The current description is too high-level and lacks the necessary details to understand the implementation and limitations of the validation component. The authors should also provide a detailed analysis of the time overhead introduced by the validation step, including how it scales with the number of tools and the complexity of the validation process. It would also be beneficial to explore alternative validation methods that might be more efficient or less computationally expensive. For example, a lightweight syntax check or a simple compatibility check based on tool input/output types could be considered. The paper should also discuss the trade-offs between the robustness provided by the validation step and the potential increase in latency. Fourth, the authors should include a detailed analysis of the computational overhead introduced by the retrieval and validation steps. This analysis should include a breakdown of the time spent on each component of the framework, such as retrieval, validation, and invocation. The authors should also compare the computational costs of the proposed approach with those of simpler methods, such as directly prompting the LLM with all available tools. This analysis should be conducted across different hardware configurations to provide a more comprehensive understanding of the framework's practical efficiency. Furthermore, the authors should discuss the trade-offs between accuracy gains and computational costs, providing guidance on when the proposed approach is most appropriate. Finally, the authors should provide a more detailed analysis of the relationship between tool selection and overall task completion. The authors should consider including metrics that measure the quality of the final outputs, such as the accuracy or completeness of the results. For example, how does the accuracy of tool selection impact the quality of the final answer? The paper should also discuss the potential for error propagation, where an incorrect tool selection leads to an incorrect final output. The authors should also consider the impact of the RAG-MCP framework on the overall efficiency of the system, including the time required for tool selection and invocation. A more detailed analysis of these factors would provide a more complete picture of the benefits and limitations of the proposed approach.

❓ Questions

Based on my analysis, I have several questions that I believe are crucial for a deeper understanding of the RAG-MCP framework. First, how does the performance of RAG-MCP scale with the number of available tools? Is there a point at which the retrieval mechanism becomes less effective, and how does the validation step's overhead change with a larger toolset? Second, can the framework be extended to handle scenarios where multiple tools need to be invoked sequentially or in parallel to complete a task? The current framework seems to focus on single tool selection, but many real-world tasks require multiple tools. Third, how does the choice of the retriever model (Qwen-max in this case) affect the overall performance of RAG-MCP? Would a different model lead to better retrieval accuracy or efficiency, and what are the trade-offs involved in choosing a particular retriever? Fourth, what are the potential failure modes of the validation step, and how can they be mitigated? Specifically, how does the framework handle cases where the generated query for validation is itself flawed or how does the validation process deal with tools that have ambiguous or poorly documented schemas? Fifth, how does RAG-MCP compare to other existing methods for managing large toolsets in LLMs, particularly those that use fine-tuning or in-context learning? A more detailed comparison would help to understand the specific advantages and disadvantages of RAG-MCP relative to other approaches. Sixth, what is the impact of the validation step on the overall latency of the system, and how can this be optimized? A detailed analysis of the time overhead introduced by the validation step, including how it scales with the number of tools and the complexity of the validation logic, is crucial. Finally, while the paper shows improvements in tool selection accuracy, how does this translate to the quality of final outputs or task completion rates? A more detailed analysis of the relationship between tool selection and overall task completion would be valuable.

📊 Scores

Soundness:2.5
Presentation:2.25
Contribution:1.75
Rating: 3.75

AI Review from ZGCA

ZGCA Review available after:
--d --h --m --s

📋 AI Review from ZGCA will be automatically processed

📋 Summary

The paper proposes RAG-MCP, a retrieval-augmented framework for LLM tool selection within the Model Context Protocol (MCP) ecosystem. Instead of prompting the LLM with all available MCP tool schemas (which causes prompt bloat and selection errors), RAG-MCP maintains an external vector index of MCP metadata and retrieves the top-k relevant tools for a given user query. Only the selected tool's schema is injected into the LLM prompt, reducing context size and simplifying decisions (Section 3.2). The authors also design an 'MCP stress test' varying the number of candidate tools N up to 11,100 (Section 4.1) and evaluate on a web-search subset of MCPBench (Luo et al., 2025) with baselines: Blank Conditioning, simple keyword pre-filter ('Actual Match'), and RAG-MCP (Sections 4.2.2–4.2.4). They report that RAG-MCP reduces prompt tokens (e.g., 1084 vs 2133.84) and improves tool selection accuracy (43.13% vs 13.62% for Blank), claiming improved scalability and extensibility.

✅ Strengths

  • Addresses a real and growing practical problem—prompt bloat and decision overhead when many MCP tools are available (Sections 1.1, 3.1).
  • Clear architectural idea: decouple tool discovery from generation via retrieval over an MCP index; only inject the selected schema into the prompt (Sections 3.2–3.3).
  • Promising empirical signal: on MCPBench web-search tasks, reported accuracy improvement (43.13% vs 13.62%) and substantial prompt-token reduction (Table 1).
  • Extensibility argument is plausible: indexing new tools without retraining, and on-demand activation of MCP servers (Sections 1.2, 3.2, 3.4).
  • The stress-test framing connects to needle-in-a-haystack intuitions and highlights scaling challenges (Sections 3.1, 4.1, 5.1).

❌ Weaknesses

  • Reproducibility concerns: The main evaluation uses 'MCPBench (Luo et al., 2025)' as a held-out testbed (Section 4.2.1), but its accessibility and specification are unclear. Without a public dataset, code, or detailed protocol, the results are hard to verify.
  • Statistical rigor is limited: Table 1 reports point estimates without confidence intervals, standard deviations, or significance tests (Section 4.2.4). No ablations on top-k, different retrievers, index types, or validation strategies.
  • Evaluation scope is narrow: primarily web-search MCPs; limited analysis of generalization to other tool types or multi-tool workflows (noted only as future work in Section 6).
  • Baselines are modest: no comparison against prior retrieval-based tool/APIs systems (e.g., Gorilla, which retrieves API documentation) or stronger context compression and retrieval formulations.
  • Stress-test methodology is unclear: Section 4.1 describes presenting N MCP schemas in the prompt (consistent with Blank conditioning), yet Section 4.1.2 claims results about RAG-MCP, which would retrieve rather than include all N. It is not explicit whether and how RAG-MCP is evaluated under the same scaling protocol.
  • Inconsistent evaluation details: Section 4.2.1 mentions DeepSeek-v3 as evaluator; Section 4.2.3 mentions a 'Llama-based verifier (Llama as Judge)'. The judging protocol needs clarification and consistency.
  • Some claims are asserted but not empirically substantiated: e.g., resource efficiency and on-demand server activation are discussed (Section 3.2) but not measured (latency/throughput).
  • Novelty is incremental: applying RAG to tool schema retrieval is a natural extension of existing retrieval-based approaches; related work like Gorilla uses retrieval over API docs for tool use. The paper would benefit from a tighter positioning versus such work.

❓ Questions

  • MCPBench details: Is the MCPBench web-search subset public? Please provide dataset composition, task definitions, ground-truth MCP annotation procedure, and access instructions. If not public, can you release a reproducible subset?
  • Judging protocol: Section 4.2.1 states DeepSeek-v3 is used as evaluator, while Section 4.2.3 cites 'Llama as Judge'. Which is used in Table 1? How are ties/disagreements resolved? Please share prompts, scoring rubrics, and calibration results for the judge model.
  • Stress test design: In Section 4.1 you present N MCP schemas in the prompt, but in Section 4.1.2 you draw conclusions about RAG-MCP. Did the stress test evaluate RAG-MCP or only Blank conditioning? If it included RAG-MCP, how was N operationalized (e.g., size of the retrieval corpus vs number of schemas in prompt), and how did you ensure parity across methods?
  • Retrieval specifics: What embedding model(s), index type (e.g., FAISS/HNSW), similarity metric, and top-k were used? Did you tune k? Do results change with k>1 (injecting top-3 or top-5 schemas)?
  • Validation step (Section 3.2): How often is the sanity-check used? What is the success/failure criterion? How do you handle false positives/negatives? Please quantify overhead and benefits.
  • Baselines: Can you add comparisons with retrieval-based API/doc approaches (e.g., Gorilla-like retrieval of tool docs), BM25 lexical retrieval, or hybrid retrieval? Also consider context compression baselines that rewrite/cluster tool schemas.
  • Statistical reporting: Please add confidence intervals or standard deviations across the 20 trials, and report per-task accuracy to assess variance. How many random seeds were used?
  • Efficiency metrics: Beyond prompt-token counts, can you report end-to-end latency, retrieval time, and server activation overheads to substantiate the resource-efficiency claims?
  • Generality: Have you evaluated beyond web search (e.g., code execution MCPs, data connectors)? What happens with multi-tool chains (planner+retriever)?
  • Scale claims: You mention 4,400+ MCP servers (Section 2.3) but stress test varies N up to 11,100 (Section 4.1). How were additional distractors constructed? Are they realistic?
  • Safety: What guardrails prevent unsafe or unintended tool invocations after retrieval? Any authentication/permission checks integrated into the retrieval/validation stage?

⚠️ Limitations

  • Dependence on retriever quality: Performance hinges on semantic retrieval precision; degradation at very large registry sizes is acknowledged (Sections 5.1–5.2).
  • Single-tool assumption: The current pipeline injects only one selected MCP schema; multi-tool workflows are left to future work (Section 6).
  • Evaluation breadth: Results primarily focus on web search and do not establish generality to diverse MCP categories or multi-turn complex tasks.
  • Reproducibility: Reliance on MCPBench without clear public access hinders independent verification; missing code/index details and hyperparameters.
  • Potential negative impacts: Incorrect retrieval or validation could invoke wrong or unsafe tools; no explicit discussion of permissioning, auditing, or user-consent mechanisms.
  • System trade-offs: Retrieval adds system complexity and latency; not quantified. At extreme scale (N>~1000), retrieval precision and throughput degrade (Section 5.1).

🖼️ Image Evaluation

Cross‑Modal Consistency: 32/50

Textual Logical Soundness: 22/30

Visual Aesthetics & Clarity: 13/20

Overall Score: 67/100

Detailed Evaluation (≤500 words):

1. Cross‑Modal Consistency

Visual ground truth

• Figure 1: two panels. (a) MCP: stacked blocks “Prompt/LLM/Query/(MCP) System/MCP Result/LLM Response”. (b) RAG‑MCP: adds “RAG‑MCP Tool MCP” before invocation; similar blocks after.

• Figure 2: three‑stage pipeline with icons: Query Encoding (Qwen Retriever) → Vector Search & Validation (Top‑k MCPs) → LLM Invocation (f()).

• Figure 3: heatmap; x‑axis “MCP Number”, y‑axis “Key MCP Position”; yellow=success, purple=failure; many axis ticks illegible.

• Major 1: Fig. 3 does not “plot selection accuracy vs N”; it’s a success heatmap by position, not an accuracy curve, conflicting with Sec. 4.1.2’s description and the claim of a “non‑monotonic trend” with N. Evidence: “Figure 3 plots selection accuracy and task success as N increases.” (Sec 4.1.2)

• Major 2: Evaluator inconsistency between sections. Evidence: “we employ Deepseek‑v3 as our evaluator” (Sec 4.2.1) vs “Judgment… Llama‑based verifier (‘Llama as Judge’)” (Sec 4.2.3).

• Minor 1: Naming flip between “RAG‑MCP” and “MCP‑RAG” across text and table (Table 1 header shows “MCP‑RAG”; method elsewhere “RAG‑MCP”).

• Minor 2: Fig. 2 caption says “Qwen‑max” while figure text shows “Qwen Retriever”.

• Minor 3: “over 50%” token reduction (Abstract/Conclusion) vs Table 1 1084 vs 2133.84 ≈ 49.2%.

2. Text Logic

• Major 1: Stress‑test setup (Sec 4.1.1) presents all N schemas in‑prompt (no retrieval), yet Sec 4.1.2 attributes mitigation to MCP/RAG in that same test, blurring whether retrieval was used in the stress test. Evidence: “present the model with N MCP schemas… ask it to select…,” (Sec 4.1.1) vs “MCP‑RAG greatly mitigates prompt bloat.” (Sec 4.1.2)

• Minor 1: Claims of “Resource Efficiency… instantiate only the selected MCP” and “Multi‑Turn Robustness” lack quantitative evidence in Experiments.

• Minor 2: Baseline naming (“Blank” vs “Blank Conditioning”) and dataset description (MCPBench subset) are terse, limiting reproducibility details.

3. Figure Quality

• Major 1: Fig. 3 illegible at print size: axis ticks/numbers for N up to 11100 are too small; colorbar text thin; values hard to read. Evidence: Figure 3.

• Minor 1: Fig. 3 lacks a clear legend entry for “success rate per N”; consider adding aggregated accuracy curve or marginal plots.

• Minor 2: Fig. 1/2 are readable but would benefit from call‑outs linking boxes to section names (e.g., “Retrieval (Sec 3.2)”).

Key strengths:

  • Clear motivation for retrieval‑based tool selection; concise architecture diagram (Fig. 2).
  • Empirical table shows sizable accuracy gains and prompt‑token savings.

Key weaknesses:

  • Cross‑section inconsistencies (evaluator, method naming) and stress‑test ambiguity.
  • Fig. 3 both mismatched to text and partially illegible, blocking verification of scaling claims.
  • Some claims (efficiency, multi‑turn robustness) lack direct measurements.

Recommendations:

  • Replace Fig. 3 with: (a) accuracy vs N line plot; (b) token/latency vs N; keep heatmap as supplementary with readable ticks.
  • Resolve evaluator and naming inconsistencies; state clearly whether retrieval is used in the stress test.
  • Quantify resource efficiency and multi‑turn performance.

📊 Scores

Originality:2
Quality:2
Clarity:2
Significance:3
Soundness:2
Presentation:2
Contribution:2
Rating: 4

AI Review from SafeReviewer

SafeReviewer Review available after:
--d --h --m --s

📋 AI Review from SafeReviewer will be automatically processed

📋 Summary

This paper introduces RAG-MCP, a framework designed to enhance the efficiency of large language models (LLMs) when utilizing external tools, specifically within the Model Context Protocol (MCP). The core idea is to employ a retrieval-augmented generation approach to address the issue of prompt bloat, which occurs when the LLM is presented with descriptions of all available tools. Instead of providing the LLM with all tool descriptions, RAG-MCP uses a retriever model to select the most relevant tools based on the user's query, thereby reducing the prompt size and simplifying the tool selection process. The method involves three key steps: retrieval, where a lightweight LLM encodes the user's task and performs a semantic search over an index of MCP metadata; validation, where a few-shot example query is generated and tested for compatibility; and invocation, where the selected tool's schema is injected into the LLM's prompt. The authors evaluate RAG-MCP on a web search benchmark, comparing it against two baselines: 'Blank Conditioning,' where the LLM is provided with all tool descriptions, and 'Actual Match,' which uses keyword matching to filter the tool set. The results demonstrate that RAG-MCP significantly reduces prompt token usage and improves tool selection accuracy compared to the baselines. The paper highlights the potential of RAG-MCP to enable more scalable and accurate tool integration for LLMs, particularly in scenarios with a large number of available tools. However, the paper also acknowledges limitations, such as the performance degradation of the retrieval mechanism when the tool registry scales to thousands of MCPs. Overall, the paper presents a practical approach to addressing prompt bloat in tool-using LLMs, but it also reveals areas where further research and development are needed to enhance its robustness and generalizability.

✅ Strengths

I found the paper's core idea of applying retrieval-augmented generation to the problem of tool selection in LLMs to be quite intuitive and well-motivated. The issue of prompt bloat, especially when dealing with a large number of tools, is a significant challenge, and the proposed RAG-MCP framework offers a practical solution. The paper clearly articulates the problem and the proposed approach, making it easy to follow. The experimental results, while limited in scope, do demonstrate the effectiveness of RAG-MCP in reducing prompt token usage and improving tool selection accuracy compared to the baselines. The authors' decision to use an external vector index for tool metadata allows for the addition of new tools without retraining the LLM, which is a significant advantage in dynamic environments. The paper also acknowledges the limitations of the approach, such as the performance degradation of the retrieval mechanism at scale, which I appreciate. The inclusion of a validation step, where a few-shot example query is generated and tested for compatibility, is a good addition that enhances the robustness of the framework. The paper's focus on the Model Context Protocol (MCP) is also a strength, as it addresses a specific and relevant standard for tool integration in LLMs. The paper's clear articulation of the problem, the proposed solution, and the experimental results makes it a valuable contribution to the field.

❌ Weaknesses

After a thorough examination of the paper, I've identified several weaknesses that warrant careful consideration. Firstly, the paper's technical contribution is somewhat limited. The core idea of using retrieval to select relevant tools is a straightforward application of standard RAG techniques. As Reviewer 1 pointed out, the method essentially involves embedding the tool descriptions and performing a maximum inner product search against the user query embedding. The paper does not explore more sophisticated retrieval methods, such as those incorporating tool functionality or user context, which could lead to more robust and accurate tool selection. The validation step, while a good addition, is also a relatively simple approach and does not fully address the complexities of tool validation. The paper's reliance on a single retriever model (Qwen-max) without exploring alternatives or providing a detailed analysis of its impact on the overall performance is another weakness. Secondly, the paper's experimental evaluation is limited in several aspects. The comparison is only made against two baselines, 'Blank Conditioning' and 'Actual Match,' which are relatively simple. The absence of comparisons with more sophisticated tool-selection methods, such as fine-tuned models or other retrieval-based approaches, makes it difficult to assess the true effectiveness of RAG-MCP. As Reviewer 2 noted, the 'Blank Conditioning' baseline, which involves providing the LLM with all tool descriptions, is not a realistic or practical approach, and its poor performance is somewhat expected. The lack of ablation studies to analyze the impact of different components of the RAG-MCP framework, such as the retriever model or the validation mechanism, is also a significant limitation. The paper also lacks a detailed analysis of the retrieval performance itself, such as precision and recall, which would provide more insight into the strengths and weaknesses of the approach. The evaluation is also limited to a single benchmark (MCPBench) and a single base LLM (Qwen-max), which raises concerns about the generalizability of the findings. As Reviewer 3 pointed out, the paper does not address the potential for error propagation due to the reliance on the retriever for correct tool selection. If the retriever fails, the entire process is likely to fail. The paper also lacks a detailed analysis of the computational cost of the proposed method, especially the overhead introduced by the retrieval step. The paper's discussion of the limitations of the approach is also somewhat brief. While the paper acknowledges the performance degradation of the retrieval mechanism at scale, it does not delve into the specific reasons for this degradation or propose concrete solutions. The paper also does not discuss the potential for bias in the retrieval process or how this might affect the fairness and reliability of the tool selection process. Finally, the paper's writing could be improved in several areas. The description of the validation step is brief and lacks detail, and the paper could benefit from a more in-depth discussion of the limitations of the approach and potential avenues for future research. The paper also lacks a clear explanation of the metrics used in the evaluation, and the analysis of the experimental results could be more detailed. The typo in the y-axis label of Figure 3, as pointed out by Reviewer 2, is also a minor but noticeable issue.

💡 Suggestions

Based on the identified weaknesses, I recommend several concrete improvements for this paper. First, the authors should significantly expand their experimental evaluation. This includes comparing RAG-MCP against a wider range of baselines, including more sophisticated tool selection methods like fine-tuned models or other retrieval-based approaches. It would also be beneficial to include ablation studies to analyze the impact of different components of the RAG-MCP framework, such as the retriever model, the validation mechanism, and the size of the retrieved tool set. The authors should also provide a more detailed analysis of the retrieval performance itself, including metrics like precision and recall. Second, the authors should explore more sophisticated retrieval techniques beyond simple semantic similarity. This could involve incorporating tool functionality, user context, or task-specific information into the retrieval process. For example, they could explore methods that learn tool embeddings based on their usage patterns or that incorporate user feedback to refine tool selection over time. The authors should also consider the computational cost of the retrieval process and explore methods to optimize it for real-time applications. Third, the authors should investigate the robustness of RAG-MCP to noisy or ambiguous queries. This could involve testing the framework with a wider range of queries and analyzing its performance under different conditions. The authors should also explore methods to improve the validation step, such as using more sophisticated validation techniques or incorporating user feedback to refine the validation process. Fourth, the authors should address the potential for error propagation due to the reliance on the retriever for correct tool selection. This could involve exploring methods to mitigate the impact of retrieval errors, such as using a fallback mechanism or incorporating uncertainty into the tool selection process. The authors should also provide a more detailed analysis of the computational cost of the proposed method, especially the overhead introduced by the retrieval step. Fifth, the authors should expand the scope of their evaluation beyond the MCP framework. While MCP is a relevant context, the core idea of tool selection via retrieval could be applied to other tool-using LLM frameworks. Discussing the generalizability of RAG-MCP and how it could be adapted to different tool interfaces would significantly broaden the paper's impact. For example, the paper could explore how the retrieval mechanism could be modified to handle tools with different input and output specifications, or how the validation step could be adapted to different types of tool functionality. Sixth, the authors should provide a more detailed analysis of the limitations of the approach and potential avenues for future research. This includes discussing the potential for bias in the retrieval process and how this might affect the fairness and reliability of the tool selection process. Finally, the authors should improve the clarity and completeness of their writing. The description of the validation step should be expanded, and the paper should provide a more detailed explanation of the metrics used in the evaluation. The analysis of the experimental results should also be more detailed, and the paper should be carefully proofread to eliminate typos and other errors.

❓ Questions

After reviewing the paper, I have several questions that I believe would benefit from further clarification. First, I'm curious about the specific details of the validation step. The paper mentions generating a few-shot example query and testing its response, but I'd like to understand more about how this is done in practice. What kind of examples are generated, and how is the response evaluated? How does the system handle cases where the example query is not representative of the actual task? Second, I'm interested in the choice of the retriever model. The paper uses Qwen-max, but it doesn't provide a detailed justification for this choice. What other retriever models were considered, and why was Qwen-max selected? How sensitive is the performance of RAG-MCP to the choice of the retriever model? Third, I'd like to know more about the computational cost of the proposed method. The paper mentions that RAG-MCP reduces prompt token usage, but it doesn't provide a detailed analysis of the computational overhead introduced by the retrieval step. How does the computational cost of RAG-MCP compare to other tool selection methods? Is the retrieval step fast enough for real-time applications? Fourth, I'm curious about the generalizability of RAG-MCP to other tool-using LLM frameworks. The paper focuses on the MCP framework, but how could the proposed method be adapted to different tool interfaces? What modifications would be needed to handle tools with different input and output specifications? Fifth, I'd like to understand more about the limitations of the approach. The paper mentions the performance degradation of the retrieval mechanism at scale, but what are the specific reasons for this degradation? What other limitations does the approach have, and what are the potential avenues for future research? Finally, I'm interested in the potential for bias in the retrieval process. How might bias in the tool descriptions or the retriever model affect the fairness and reliability of the tool selection process? What steps could be taken to mitigate this bias?

📊 Scores

Confidence:3.75
Rating: 2.5

Keywords

Click the button to extract keywords

Insights

Click the button to extract insights
Version 1
Citation Tools

📝 Cite This Paper