📋 AI Review from DeepReviewer will be automatically processed
📋 AI Review from ZGCA will be automatically processed
The paper introduces Coherent Masked Diffusion (CoMD), a discrete diffusion-style framework for next-token prediction that aims to improve coherence and efficiency over prior masked diffusion language models (MLD/MDLM). CoMD is claimed to differ from prior work in three ways: (1) it uses a fixed mask matrix independent of token identity and position (Section 3.1), (2) it adds a "coherent loss" term intended to guide predictions toward the "ground truth coherent probability" without extra samples per step (Section 3.2, Eqs. 2–8), and (3) it introduces a time-shift parameter to better capture the end of coherent text, applied both to the coherent loss gating and the deterministic sampler (Section 3.3, Eqs. 11–12). The paper asserts constant-time inference/training with respect to sequence length (multiple places, e.g., Abstract, Sections 3.1/3.4), reports speedups over MLD/MDLM, and presents results on tasks including Sudoku/MNIST (presented as "image" tasks) and text benchmarks such as logic puzzles and SlimPajama. Ablations (Table 2) aim to quantify the contributions of the fixed mask, coherent loss, and time shift.
Cross‑Modal Consistency: 18/50
Textual Logical Soundness: 10/30
Visual Aesthetics & Clarity: 8/20
Overall Score: 36/100
Detailed Evaluation (≤500 words):
1. Cross‑Modal Consistency
• Major 1: Claimed 7.3×/10.5× inference speedups are not supported by reported times. Evidence: Abstract: “achieves an inference speedup of 7.3x and 10.5x”; Table 1: MDLM 61.01 s vs CoMD 61.41 s.
• Major 2: “Constant” compute vs later O(nk²) contradiction. Evidence: Abstract/§3: “Both inference and training computation are constant…”; §3.4: “total FLOPs per second grow as O(nk²)”.
• Major 3: Task/caption mismatch (Sudoku vs Selenium). Evidence: §4.2: “The Sudoku benchmark…”; Table 1 caption: “Length 512 Selenium Test”.
• Major 4: Missing cross‑refs. Evidence: §3.1: “(Table ??)”; §3.4: “(Table ??)”.
• Minor 1: Model naming wavers (MDLM vs MLDLM), e.g., §1–2 use both for different concepts.
• Minor 2: Dataset names vary (Sudoku/Sudo/Suoko/Suko).
2. Text Logic
• Major 1: Loss/sampler definitions are ill‑posed or syntactically invalid. Evidence: Eq. (11) subtracts an indicator term; Eqs. (9–10) have mismatched indices/parentheses.
• Major 2: Undefined or nonstandard constructs block understanding. Evidence: §2 MLD: “hyper‑geometric interlacing distribution,” “k‑lines one‑hot vector” (not defined).
• Minor 1: Repetition/duplication (e.g., §4.1 “For image data…” appears twice).
• Minor 2: Claims of 75× parameter savings and 1.7–1.8× FLOP reductions lack concrete tables/figures specifying model sizes/training regimes.
3. Figure Quality
Visual ground truth
• Table 1: Columns Model, Acc.%↑, DA%↓, Time(s)↓; task caption says “Selenium Test” though text discusses Sudoku.
• Table 2: Labeled (a) accuracy on “Selenium” and (b) perplexity on SlimPajama; presented as a single HTML table with two “Model” blocks; rows “k Learned,” “ℓcoh,” “γ,” “ℓcoh-γ” are ablations.
• Major 1: DA% metric undefined/contradictory. Evidence: §4.2: “DA% (area under the OOD perplexity)”; Table 1 header “DA%↓” without units/definition; Eq. (13) is incoherent.
• Major 2: Table 2 merges two sub‑tables into one, obscuring which columns belong to (a) vs (b). Evidence: “Table 2: Ablations on Selenium and SlimPajama Test (a)… (b)…”.
• Minor 1: Abbreviations (↑/↓, DA%) and dataset names not explained in captions; readers cannot interpret tables alone.
• Minor 2: Inconsistent precision (e.g., “1.13±2.2”) suggests formatting/measurement issues.
Key strengths:
Key weaknesses:
📋 AI Review from SafeReviewer will be automatically processed
This paper introduces Coherent Masked Diffusion (CoMD), a novel framework designed to enhance the generation of coherent text using masked diffusion models. Building upon the existing Masked Language Diffusion (MLD) model, CoMD incorporates three key innovations: a fixed mask matrix, a coherent loss term, and a variable time parameter. The fixed mask matrix aims to improve computational efficiency by avoiding the need to recompute the mask at each step, while the coherent loss term is designed to guide the model towards generating more coherent text by explicitly modeling the coherent and incoherent distributions. The variable time parameter is introduced to better model the end of coherent text. The authors evaluate CoMD on a variety of tasks, including image generation (Sudoku), text generation (logic puzzles and SlimPajama), and language modeling. The empirical results demonstrate that CoMD achieves performance improvements over baseline models, particularly in terms of perplexity and accuracy on the Sudoku task. However, the paper also acknowledges that the model does not outperform autoregressive models in all tasks, and that the computational efficiency gains are primarily observed during inference, not training. The authors claim that CoMD is significantly more compute and parameter efficient than autoregressive models, but this claim is not fully supported by the experimental results. Overall, the paper presents an interesting approach to improving masked diffusion models for text generation, but it also highlights several areas that require further investigation and clarification.
I found several aspects of this paper to be commendable. The introduction of a fixed mask matrix is a notable contribution, as it addresses the computational overhead associated with dynamic masking in previous masked diffusion models. This approach has the potential to significantly improve the efficiency of these models, particularly during inference. The authors' attempt to explicitly model coherent and incoherent distributions through the introduction of a coherent loss term is also a valuable contribution. This approach demonstrates a clear understanding of the challenges associated with generating coherent text and represents a novel attempt to address these challenges within the context of masked diffusion models. The experimental results, particularly on the Sudoku task, demonstrate the effectiveness of the proposed approach. The authors show that CoMD achieves a lower perplexity and higher accuracy compared to baseline models, indicating that the proposed innovations are indeed beneficial. The paper also provides a detailed description of the proposed method, including the mathematical formulations and implementation details. This level of detail is essential for reproducibility and allows other researchers to build upon the work presented in this paper. Finally, the authors have conducted a comprehensive ablation study to evaluate the impact of each component of CoMD, which is crucial for understanding the contribution of each innovation.
While the paper presents several interesting ideas, I have identified several weaknesses that warrant further discussion. Firstly, the paper's claim of significant computational efficiency compared to autoregressive models is not fully supported by the experimental results. While the authors demonstrate that CoMD achieves a speedup over other masked diffusion models, the comparison with autoregressive models is not direct, as different models are used (e.g., TinyLLaMA vs. Llama 2). Furthermore, the paper does not provide a detailed analysis of the computational cost of training, which is a significant factor when considering the overall efficiency of a model. The authors acknowledge that training computation grows as O(nk^2), but they do not provide a direct comparison of training time or FLOPs with autoregressive models. This makes it difficult to assess the true computational benefits of CoMD. Secondly, the paper's explanation of the coherent loss term and the concept of coherent probability is not sufficiently clear. The authors introduce the notation 'p_e(x^c_t)' without a clear definition, and the explanation of how the coherent loss term guides the model towards coherent text is somewhat vague. The paper states that the coherent loss term guides the model towards the 'ground truth coherent probability', but it does not provide a detailed explanation of what this ground truth probability represents or how it is calculated. This lack of clarity makes it difficult to fully understand the mechanism by which the coherent loss term improves the generation of coherent text. Thirdly, the paper's use of the term 'coherent' is not always precise. While the paper provides some context for the term, it does not offer a formal definition, and the connection to existing literature on coherence in NLP is not explicitly made. This lack of a clear definition makes it difficult to assess the validity of the paper's claims about improving coherence. Fourthly, the paper's experimental evaluation is limited in several respects. The paper does not include a comparison with a standard next-token-prediction autoregressive model on the same tasks, which makes it difficult to assess the relative performance of CoMD. The paper also does not evaluate the model on more challenging tasks that require long-range coherence, such as document-level text generation or story generation. This limits the generalizability of the paper's findings. Finally, the paper's writing style is not always clear and concise. The mathematical notation is not always introduced before it is used, and the paper contains several typos and grammatical errors. This makes the paper difficult to read and understand, and it detracts from the overall quality of the work. For example, the paper uses the notation 'p_e(x^c_t)' before defining it, and the paper refers to 'MDLM' as 'MDLMs' inconsistently. The paper also states that the model is trained for '~3Ok steps', which is likely a typo. These issues, while seemingly minor, collectively impact the clarity and credibility of the paper.
Based on the identified weaknesses, I recommend several concrete improvements. First, the authors should provide a more detailed analysis of the computational cost of training CoMD, including a direct comparison of training time and FLOPs with autoregressive models. This would provide a more accurate assessment of the computational efficiency of the proposed approach. Second, the authors should provide a more precise definition of the coherent probability and a more detailed explanation of how the coherent loss term guides the model towards coherent text. This should include a clear explanation of what the 'ground truth coherent probability' represents and how it is calculated. Third, the authors should provide a formal definition of the term 'coherent' and explicitly connect it to existing literature on coherence in NLP. This would clarify the paper's claims about improving coherence and allow for a more rigorous evaluation of the proposed approach. Fourth, the authors should include a comparison with a standard next-token-prediction autoregressive model on the same tasks. This would provide a more direct comparison of the performance of CoMD and autoregressive models. The authors should also evaluate the model on more challenging tasks that require long-range coherence, such as document-level text generation or story generation. This would provide a more comprehensive assessment of the model's capabilities. Fifth, the authors should revise the paper to improve the clarity and conciseness of the writing. This should include introducing all mathematical notations before they are used, correcting all typos and grammatical errors, and ensuring that the paper is easy to read and understand. For example, the authors should define 'p_e(x^c_t)' before using it, and they should correct the typo '3Ok steps'. Finally, the authors should address the inconsistencies in the use of 'MDLM' and 'MDLMs'. These changes would significantly improve the quality and credibility of the paper.
I have several questions that arise from my analysis of the paper. First, what is the precise definition of the 'ground truth coherent probability' that the coherent loss term is designed to guide the model towards? How is this probability calculated, and what is its relationship to the actual coherent distribution of the data? Second, how does the fixed mask matrix affect the model's ability to capture long-range dependencies in text? Does the fixed mask matrix limit the model's ability to adapt to different sentence structures and lengths? Third, what is the impact of the variable time parameter on the model's performance? How does the choice of the time shift parameter affect the model's ability to generate coherent text? Fourth, how does the performance of CoMD compare to state-of-the-art autoregressive models on more challenging tasks that require long-range coherence, such as document-level text generation or story generation? Fifth, what is the computational cost of training CoMD, and how does it compare to the computational cost of training autoregressive models? What are the trade-offs between the computational efficiency of CoMD and its performance on various tasks? Finally, how does the model handle out-of-vocabulary tokens, and what is the impact of the vocabulary size on the model's performance? These questions target core methodological choices and assumptions, and they seek clarification of critical uncertainties in the paper.