📋 AI Review from DeepReviewer will be automatically processed
📋 AI Review from ZGCA will be automatically processed
The paper proposes MAAVRT, a decentralized zeroth-order algorithm for nonsmooth nonconvex stochastic optimization. MAAVRT combines (i) randomized smoothing via a two-point estimator (Equation (2)), (ii) an adaptive exponential moving average with an anchor-based correction for variance reduction (Equations (3)–(4)), and (iii) topology-aware consensus using a symmetric, doubly-stochastic mixing matrix W (Equation (5)). The analysis decomposes error into optimization, smoothing bias, estimator variance, and consensus disagreement, and claims sample complexity O(d δ^{-1} ε^{-3}) for achieving E[||g_μ(x)||] ≤ ε and consensus gap ≤ δ (Theorem 3.1; Equations (6)–(7)). Empirically, on IJCNN, COVTYPE, and A9A with 20 agents on a ring, MAAVRT is reported to reduce gradient norms and improve test accuracy versus DGFM (a zeroth-order baseline) and DPSGD (first-order, used as an upper bound), with fixed σ=0.1 and anchor period M=10 (Section 4). Practical considerations and limitations involve parameter tuning for σ, synchronous communication, and memory overhead (Section 5.2).
Cross‑Modal Consistency: 28/50
Textual Logical Soundness: 18/30
Visual Aesthetics & Clarity: 10/20
Overall Score: 56/100
Detailed Evaluation (≤500 words):
1. Cross‑Modal Consistency
• Major 1: Iteration budget mismatch. Text states 30,000 iterations, while all shown plots have Iteration axis 0–100, and claims like “<0.01 within 10,000 iters” cannot be verified. Evidence: “All methods … evaluated over 30,000 iterations” (Sec. 4.1 Baselines) vs. Fig. 1 panels with x‑axis 0–100.
• Major 2: Table‑learning‑rate mapping missing; the “Learning Rate” column shows “–” so rows cannot be associated with η∈{1e‑2,1e‑3,1e‑4}. Evidence: Table 1 first column entries are “-” while Sec. 4.1 says “We sweep learning rates over {10^{-2},10^{-3},10^{-4}}”.
• Minor 1: Extra “Final Performance Comparison” bar chart appears without a figure number or in‑text reference, creating ambiguity whether it is part of Fig. 1. Evidence: image titled “Final Performance Comparison” not cited in prose.
• Minor 2: Table 1 contains repeated values (e.g., 0.3197 across methods), suggesting copy/paste or aggregation ambiguity. Evidence: Table 1 last block shows identical “0.3197”.
2. Text Logic
• Major 1: Notation collision and inconsistency: σ denotes both smoothing radius and oracle noise variance; stationarity and theorem use μ, creating confusion about which parameter is tuned. Evidence: (A2) “noise variance … σ^2”; (A5) defines μ‑smoothing; Theorem 3.1 uses g_μ.
• Major 2: Misstatement about Eq. (2): “scaling factor d/σ^2 compensates for the smoothing‑induced bias” — scaling ensures unbiasedness for the smoothed gradient; it does not remove smoothing bias. Evidence: Sec. 3.2 sentence after Eq. (2).
• Major 3: Evaluation uses ∥∇f(x)∥ while the setting is zeroth‑order; the paper does not describe how this gradient norm is computed/estimated, so claims on stationarity cannot be audited. Evidence: Sec. 4.1 “Gradient norm ∥∇f(x)∥ …” with no estimator protocol.
• Minor 1: Variance‑reduction claim Var(ĝ)=O(1/t) stated without derivation or citation tailored to this construction. Evidence: After Eq. (4): “yielding variance decay O(1/t)”.
3. Figure Quality
• Major 1: Final 8‑panel composite (covtype/a9a) is illegible at print size (axes/legends unreadable). Evidence: last image resolution ≈203×512 px.
• Minor 1: Panels lack (a)/(b)/(c) labels; captions don’t enumerate sub‑figures, hindering cross‑reference. Evidence: Fig. 1 shows three panels with no labels.
• Minor 2: Table 1 promises “highlighted in green,” but no highlighting appears. Evidence: Table 1 rendering.
Key strengths:
Key weaknesses:
Recommendations:
📋 AI Review from SafeReviewer will be automatically processed
The paper introduces MAAVRT, a decentralized zeroth-order algorithm designed to address the challenges of nonsmooth and nonconvex optimization in a distributed setting. The key methodological approaches include the use of randomized smoothing to approximate gradients, adaptive variance reduction to improve the stability of gradient estimates, and a topology-aware consensus mechanism to ensure agreement among agents in the network. The theoretical analysis is structured around a modular error decomposition, which allows for a clear understanding of the contributions of each component to the overall convergence rate. The authors claim that MAAVRT achieves a sample complexity of $O(d
u^{-1}
g^{-3})$, matching the known lower bounds for this class of problems. Empirically, the paper demonstrates the effectiveness of MAAVRT on three standard machine learning datasets: IJCNN, COVTYPE, and A9A. The results show that MAAVRT outperforms a basic decentralized zeroth-order method (DGFM) and a decentralized first-order method (DPSGD) in terms of gradient norm, training loss, and test accuracy. However, the paper's significance is somewhat undermined by the limited scope of the experimental validation and the lack of comparison with more recent and relevant baselines. Despite these limitations, the paper makes a valuable contribution to the field of decentralized optimization, particularly in the context of zeroth-order methods and nonsmooth, nonconvex objectives.
One of the paper's core strengths lies in its theoretical analysis, which is both novel and insightful. The modular error decomposition into optimization, smoothing, variance, and consensus components provides a clear and structured understanding of the algorithm's performance. This decomposition is particularly useful for identifying the key factors that influence convergence and for designing more efficient algorithms in the future. The authors' ability to achieve a near-optimal sample complexity that matches known lower bounds is a significant theoretical achievement, especially given the challenging nature of decentralized nonsmooth nonconvex optimization with zeroth-order queries. The paper also demonstrates the practical effectiveness of MAAVRT through empirical validation on standard benchmarks. The results show that MAAVRT outperforms DGFM and DPSGD in terms of gradient norm, training loss, and test accuracy, which is a promising indication of the algorithm's potential in real-world applications. The inclusion of a topology-aware consensus mechanism is another strong point, as it addresses the practical challenges of communication in decentralized networks. The adaptive variance reduction technique, which uses moving average buffers to reduce estimator variance online, is a technical innovation that enhances the stability and efficiency of the algorithm. Overall, the paper makes a valuable contribution to the field by integrating these techniques and providing a comprehensive theoretical and empirical analysis.
Despite the paper's strengths, several limitations and weaknesses are evident. One of the most significant concerns is the limited novelty of the proposed algorithm. The core components of MAAVRT—randomized smoothing, adaptive variance reduction, and topology-aware consensus—are well-established techniques in the literature. The paper does not provide a detailed comparison with existing methods that combine similar techniques, such as those found in recent works on decentralized and zeroth-order optimization. This lack of comparison makes it difficult to assess the unique contributions of MAAVRT and its advantages over existing approaches. For instance, the paper cites works like Lin et al. (2024) and Kornowski & Shamir (2024) but does not explicitly discuss how MAAVRT's integration of these techniques differs from or improves upon these prior works. The theoretical analysis, while sound, is also somewhat limited in its scope. The paper focuses on the (C̃-stationarity) measure, which is a common choice in the literature, but does not provide a clear justification for why this measure is more meaningful than other stationarity measures, such as the Goldstein subdifferential. This omission leaves the reader questioning the practical relevance of the theoretical results. Additionally, the paper's claim of achieving near-optimal sample complexity is not fully supported by a detailed comparison with the most relevant lower bounds. The authors state that the complexity $O(d
u^{-1}
g^{-3})$ matches known lower bounds, but they do not provide a rigorous analysis of the constants and assumptions underlying these bounds. Another critical weakness is the experimental setup. The experiments are conducted on relatively simple machine learning tasks with low-dimensional datasets (IJCNN: 22 features, COVTYPE: 54 features, A9A: 123 features). These datasets do not fully capture the challenges of high-dimensional and complex optimization problems, which are more common in real-world applications. The choice of baselines is also problematic. The paper compares MAAVRT with DGFM, a basic decentralized zeroth-order method, and DPSGD, a decentralized first-order method. However, DGFM is not a state-of-the-art algorithm, and DPSGD assumes the availability of gradients, which is not a fair comparison in the context of zeroth-order methods. The paper would benefit from including more recent and relevant baselines, such as those from Lin et al. (2024) and Kornowski & Shamir (2024), to provide a more comprehensive evaluation of MAAVRT's performance. Furthermore, the paper lacks a detailed discussion of the practical implications of the algorithm, particularly in terms of computational and communication costs. While the authors mention the per-iteration computational cost of $O(d)$, they do not provide a thorough analysis of how the variance reduction window $M$ and the smoothing parameter $ au$ impact these costs. This omission is significant, as the choice of these parameters can greatly affect the algorithm's efficiency and scalability. The paper also does not address the practical challenges of deploying MAAVRT in asynchronous communication settings, which are common in real-world decentralized systems. The assumption of synchronous communication is a limitation that could be mitigated by extending the analysis to asynchronous settings. Finally, the paper's presentation could be improved. The introduction is somewhat vague and does not clearly articulate the specific limitations of existing methods that MAAVRT aims to address. The related work section is overly technical and does not provide a clear overview of the state-of-the-art. The paper would benefit from a more intuitive explanation of the algorithm's components and a clearer connection between the theoretical results and the empirical findings.
To address the identified weaknesses, I recommend several concrete and actionable improvements. First, the authors should provide a more detailed comparison with existing methods that combine similar techniques, such as those found in recent works on decentralized and zeroth-order optimization. This comparison should highlight the unique contributions of MAAVRT and explain how it differs from or improves upon these prior works. For example, the authors could discuss the specific innovations in the adaptive variance reduction technique and how it addresses the limitations of existing methods. Second, the theoretical analysis should be expanded to include a more comprehensive comparison with lower bounds. The authors should explicitly state the assumptions underlying the lower bounds and provide a detailed analysis of how MAAVRT's complexity matches these bounds. This would strengthen the claim of near-optimality and provide a clearer understanding of the algorithm's theoretical performance. Third, the experimental validation should be significantly enhanced. The authors should include experiments on more complex and high-dimensional tasks, such as non-convex bandit optimization or black-box optimization problems. These experiments would better demonstrate the algorithm's practical effectiveness and robustness. Additionally, the authors should compare MAAVRT with more recent and relevant baselines, such as those from Lin et al. (2024) and Kornowski & Shamir (2024). This would provide a more comprehensive evaluation of MAAVRT's performance and highlight its advantages over state-of-the-art methods. The authors should also conduct an ablation study to isolate the impact of each component of MAAVRT, such as the adaptive variance reduction and topology-aware consensus. This would help to understand the specific contributions of each technique and identify areas for further improvement. Fourth, the paper should include a detailed discussion of the practical implications of the algorithm, particularly in terms of computational and communication costs. The authors should analyze how the variance reduction window $M$ and the smoothing parameter $ au$ affect these costs and provide guidelines for choosing these parameters in practice. This analysis should consider the trade-offs between computational efficiency, communication overhead, and convergence rate. Fifth, the authors should extend the analysis to asynchronous communication settings, which are more common in real-world decentralized systems. This extension would make the algorithm more practical and applicable to a wider range of problems. Finally, the paper's presentation should be improved. The introduction should clearly articulate the specific limitations of existing methods that MAAVRT aims to address. The related work section should be more accessible and provide a clear overview of the state-of-the-art. The authors should also include more intuitive explanations of the algorithm's components and a clearer connection between the theoretical results and the empirical findings. These improvements would make the paper more accessible and impactful.
1. Could the authors provide a more detailed comparison of MAAVRT with existing methods that combine similar techniques, such as those from Lin et al. (2024) and Kornowski & Shamir (2024)? Specifically, how does the adaptive variance reduction technique in MAAVRT differ from or improve upon the variance reduction methods used in these prior works?
2. What is the rationale behind the choice of the (C̃-stationarity) measure for theoretical analysis? Could the authors provide a clear justification for why this measure is more meaningful than other stationarity measures, such as the Goldstein subdifferential, in the context of decentralized nonsmooth nonconvex optimization?
3. How does the variance reduction window $M$ and the smoothing parameter $ au$ impact the computational and communication costs of MAAVRT? Could the authors provide a detailed analysis of these trade-offs and guidelines for choosing these parameters in practice?
4. Could the authors extend the analysis to asynchronous communication settings? How would the convergence guarantees and sample complexity be affected in such settings, and what modifications to the algorithm would be necessary?
5. What are the specific challenges and limitations of applying MAAVRT to high-dimensional and complex optimization problems? Could the authors provide experimental results on such problems to better demonstrate the algorithm's practical effectiveness and robustness?