2510.0051 COMD: COHERENT M ASKED DIFFUSION v1

🎯 ICAIS2025 Accepted Paper

🎓 Meta Review & Human Decision

Decision:

Accept

Meta Review:

AI Review from DeepReviewer

AI Review available after:
--d --h --m --s

📋 AI Review from DeepReviewer will be automatically processed

📋 Summary

This paper introduces Coherent Masked Diffusion (CoMD), a novel framework designed to enhance the learning of coherent and incoherent language in masked language models (MLMs). Building upon the Masked Language Diffusion (MLD) model, CoMD incorporates three key innovations: a fixed mask matrix, a coherent loss term, and a variable time parameter. These modifications aim to improve the efficiency and effectiveness of learning coherent language while maintaining computational efficiency. The fixed mask matrix ensures that the mask is independent of the token and its position, which simplifies the denoising process. The coherent loss term is designed to optimize the probability of coherent text generation without requiring additional samples per training step. The variable time parameter guides the coherent probability towards the ground truth coherent probability, further enhancing the model's performance. Empirically, CoMD outperforms previous methods on multiple coherent benchmarks, demonstrating significant speedups and parameter efficiency compared to autoregressive models. The paper provides a thorough background on MLMs, MLD, and masked diffusion language models (MDLMs), which helps readers understand the context and significance of the proposed CoMD framework. However, the paper could benefit from a more detailed discussion of the limitations of the fixed mask matrix, particularly in capturing long-range dependencies and its sensitivity to different initializations. Additionally, the paper lacks specific details on the implementation of the denoising network and the training procedures, which are crucial for reproducibility. The paper also lacks a qualitative analysis of the generated text, which would provide valuable insights into the model's performance. Despite these limitations, CoMD represents a significant step forward in the field of natural language processing, particularly in the context of generating coherent text.

✅ Strengths

I find the paper to be well-written and easy to follow, which is a significant strength. The proposed method, CoMD, is novel and well-motivated, addressing key limitations in existing masked language models. The introduction of a fixed mask matrix, a coherent loss term, and a variable time parameter are innovative contributions that enhance the model's ability to generate coherent text. The empirical evaluations are comprehensive, demonstrating that CoMD outperforms previous methods on multiple coherent benchmarks. The results show that CoMD achieves significant speedups and parameter efficiency, which are important practical advantages. The paper also provides a thorough background on MLMs, MLD, and MDLMs, which helps readers understand the context and significance of the proposed CoMD framework. The authors' discussion of the computational efficiency of CoMD, including inference speedup and FLOPs per second, is particularly noteworthy. The paper's contributions are significant in the field of natural language processing, especially in the context of generating coherent text. The authors have done a commendable job in presenting their method and results in a clear and concise manner, making the paper accessible to a broad audience.

❌ Weaknesses

Despite the paper's strengths, there are several verified limitations that need to be addressed. First, the paper could benefit from a more detailed discussion of the limitations of the fixed mask matrix. The fixed mask matrix, while computationally efficient, might struggle with capturing long-range dependencies inherent in natural language, where the relevance of tokens can span across large segments of text. This could lead to a failure in capturing nuanced contextual relationships, especially in longer sequences. The paper does not explore the sensitivity of the model to different mask matrix initializations, which could impact the stability and performance of the model. For instance, the fixed mask matrix is described as independent of the token and its position, but the paper does not provide a formal definition of the mask matrix, including its dimensions, sparsity, and how it is generated. This lack of detail makes it difficult to assess the novelty and effectiveness of the proposed method. The authors should investigate the use of adaptive masking strategies that can dynamically adjust the mask based on the input sequence, potentially capturing long-range dependencies more effectively. Furthermore, the paper should explore the sensitivity of the model to different mask matrix initializations and provide guidelines for selecting appropriate initializations. This would enhance the robustness and reliability of the model. My confidence level in this issue is high, as the paper does not provide the necessary details to fully understand the implications of the fixed mask matrix. Second, the paper lacks specific details on the implementation of the denoising network and the training procedures. The architecture of the denoising network, including the number of layers, the type of activation functions, and the dimensionality of the hidden layers, is not fully described. The training procedure, including the optimization algorithm, the learning rate schedule, and the batch size used, is also not detailed. These details are crucial for reproducibility and for understanding the practical aspects of the proposed method. The reliance on external references for core architectural details hinders the paper's self-contained reproducibility. My confidence level in this issue is high, as the paper explicitly refers to external references for some of these details, and the denoising network architecture is not clearly specified. Third, the paper lacks a qualitative analysis of the generated text. While the paper provides quantitative metrics like perplexity and accuracy, it does not offer examples of text generated by CoMD compared to existing methods. This omission makes it difficult to assess the coherence and fluency of the generated text. A qualitative analysis would provide valuable insights into the model's performance and help to highlight the incoherencies and inconsistencies that CoMD aims to address. My confidence level in this issue is high, as the paper focuses solely on quantitative results without any qualitative examples. Fourth, the paper's evaluation is limited to relatively small-scale models. The experiments primarily focus on models with 4-layer transformers and 8-layer convolutional networks, which are smaller compared to state-of-the-art large language models with billions of parameters. This limitation makes it difficult to assess the scalability and effectiveness of CoMD in more complex scenarios. The authors should explore the performance of CoMD on larger models to provide a more comprehensive understanding of its potential for real-world applications. My confidence level in this issue is high, as the experimental setup consistently uses smaller models, and the paper does not evaluate CoMD on larger-scale models. Finally, the paper could benefit from a more detailed analysis of the computational cost of CoMD, including a breakdown of the time and memory requirements for both training and inference. While the paper provides some comparisons of inference time and FLOPs per second, a more thorough analysis would help to understand the practical advantages and limitations of the proposed method. My confidence level in this issue is medium, as the paper does provide some computational cost analysis, but it could be more detailed.

💡 Suggestions

To address the limitations identified, I recommend several concrete, actionable improvements. First, the authors should explore adaptive masking strategies that can dynamically adjust the mask based on the input sequence. For example, a learned mask matrix could be implemented, where the mask is parameterized by a neural network that takes the input sequence as input. This would allow the model to learn which tokens are most relevant for predicting the next token, potentially capturing long-range dependencies more effectively. Additionally, the authors should investigate the use of sparse attention mechanisms in conjunction with the mask matrix to reduce computational complexity while still allowing for flexible contextual modeling. This would involve exploring different sparsity patterns and their impact on model performance and efficiency. The authors should also conduct a sensitivity analysis of the model to different mask matrix initializations and provide guidelines for selecting appropriate initializations. This would enhance the robustness and reliability of the model. Second, the authors should provide a detailed description of the neural network architecture used for the denoising process. This should include the number of layers, the type of activation functions, the dimensionality of the hidden layers, and the specific details of the embedding layer. The authors should also provide a detailed description of the training procedure, including the optimization algorithm, the learning rate schedule, the batch size, and the number of training epochs. The hardware and software used for training and evaluation should also be specified. Furthermore, the authors should consider releasing the code and trained models to the public, which would greatly facilitate the reproducibility and adoption of the proposed method. Third, the authors should include a qualitative analysis of the generated text. This should involve providing examples of text generated by CoMD and comparing it to text generated by existing methods. The analysis should highlight the incoherencies and inconsistencies that CoMD aims to address, such as repetitive or contradictory text. This would help to motivate the need for the proposed method and demonstrate its advantages over existing approaches. Fourth, the authors should evaluate the performance of CoMD on larger-scale models, such as those with 1B or more parameters. This would involve not only training larger models but also analyzing the computational cost and memory requirements, as well as the potential need for distributed training strategies. The authors should investigate the impact of different architectural choices on the performance of CoMD when scaling up to larger models, such as the depth and width of the transformer networks used in the diffusion process. Such an analysis would provide a more comprehensive understanding of the method's limitations and potential for real-world applications. Finally, the authors should conduct a more thorough analysis of the computational complexity of CoMD, including a breakdown of the time and memory requirements for both training and inference. This analysis should consider the impact of different sequence lengths and model sizes on the computational cost. The authors should compare the computational complexity of CoMD with that of other diffusion-based language models and autoregressive models. This would provide a more comprehensive understanding of the practical advantages and limitations of the proposed method. The authors should also explore techniques for optimizing the implementation of CoMD to reduce its computational cost, such as using more efficient matrix operations or model compression techniques.

❓ Questions

1. Can the authors provide more details on the computational complexity of CoMD compared to other diffusion-based language models? Specifically, how does the fixed mask matrix and the coherent loss term impact the computational cost during training and inference? 2. How does the performance of CoMD scale with the size of the model and the dataset? It would be helpful to see results on larger models, such as those with 1B or more parameters, to assess the scalability and effectiveness of the proposed method in more complex scenarios. 3. Could the authors elaborate on the initialization of the fixed mask matrix? How sensitive is the model to different initializations, and what guidelines can be provided for selecting appropriate initializations? 4. How does the variable time parameter in CoMD affect the diffusion process, and what is the rationale behind its specific implementation? 5. Can the authors provide examples of text generated by CoMD and compare it to text generated by existing methods? This would help to understand the qualitative improvements in coherence and fluency. 6. How does CoMD handle noisy or incomplete input data, and what techniques can be used to improve its robustness to such perturbations? 7. What are the potential trade-offs between computational efficiency and the quality of generated text in CoMD, and how can these trade-offs be managed in practical applications?

📊 Scores

Soundness:2.75
Presentation:2.25
Contribution:2.75
Rating: 5.25

AI Review from ZGCA

ZGCA Review available after:
--d --h --m --s

📋 AI Review from ZGCA will be automatically processed

📋 Summary

The paper introduces Coherent Masked Diffusion (CoMD), a discrete diffusion-style framework for next-token prediction that aims to improve coherence and efficiency over prior masked diffusion language models (MLD/MDLM). CoMD is claimed to differ from prior work in three ways: (1) it uses a fixed mask matrix independent of token identity and position (Section 3.1), (2) it adds a "coherent loss" term intended to guide predictions toward the "ground truth coherent probability" without extra samples per step (Section 3.2, Eqs. 2–8), and (3) it introduces a time-shift parameter to better capture the end of coherent text, applied both to the coherent loss gating and the deterministic sampler (Section 3.3, Eqs. 11–12). The paper asserts constant-time inference/training with respect to sequence length (multiple places, e.g., Abstract, Sections 3.1/3.4), reports speedups over MLD/MDLM, and presents results on tasks including Sudoku/MNIST (presented as "image" tasks) and text benchmarks such as logic puzzles and SlimPajama. Ablations (Table 2) aim to quantify the contributions of the fixed mask, coherent loss, and time shift.

✅ Strengths

  • Targets a timely problem: improving coherence and efficiency of masked diffusion-based language models.
  • Proposes a clear set of three algorithmic modifications (fixed mask, coherent loss, time shift), which are easy to ablate conceptually (Sections 3.1–3.3).
  • Includes ablation intent (Table 2) to attribute performance to specific components.
  • Aims for practical efficiency (e.g., inference speedups, FLOPs reductions) and claims constant-computation characteristics relative to sequence length.
  • Positions contributions relative to both MLD and MDLM (Section 4.1), attempting to compare across multiple tasks (Sudoku/MNIST, logic puzzles, SlimPajama).

❌ Weaknesses

  • Severe contradictions and ambiguities undermine soundness and credibility: (a) parameter-count claims conflict—"75x fewer parameters than MLD/MDLM" (Abstract/Section 1) vs. "MLD and CoMD use an identical number of parameters" (Section 4.2), making all efficiency comparisons suspect; (b) training/inference complexity claims are internally inconsistent—both "constant with respect to length" (Abstract; Sections 3.1, 3.4) and O(n k^2) due to RoPE (Section 3.4) are stated; (c) the OOD perplexity/DA% metric (Eq. 13) is not mathematically coherent as written (PPL of an indicator set).
  • The "coherent loss" relies on a "ground truth coherent probability" (Section 3.2, Eqs. 3–5) that is never operationally defined: how are coherent tokens labeled or how is x_t^* obtained? The gating via Eq. 6 uses [MASK]-ratio heuristics rather than annotation of coherence, further obscuring the supervision signal.
  • Methodological clarity is insufficient: many equations include undefined symbols (e.g., k-lines, hyper-geometric interlacing distribution; Eqs. 1, 9–12), or mix incompatible distributions (claiming x_mask and z are independent Gaussians then writing x_mask|z ~ N(uz, Σ)). The sampler equations (Eqs. 9–10, 12) are unclear and appear ad hoc.
  • Experimental reporting is inconsistent and, in parts, implausible: multiple dataset name confusions ("Selenium" vs. Sudoku; "Sodomu/Suoko/Suko"), and incongruent modality choices (e.g., training TinyLlama for image tasks; using Llama 2 7B or even 27B on a single RTX A5000 GPU for SlimPajama is not credible as described).
  • Ablation and results inconsistencies: the text claims ℓ_coh improves performance most (Section 4.4), but the provided ablation table shows rows labeled “ℓcoh” with better perplexities than CoMD in some columns, contradicting the narrative. This raises concerns about table correctness and interpretation.
  • Insufficient reproducibility details: no random seeds, incomplete hardware/memory configs, and missing training/sampling schedules and hyperparameters for all baselines; unclear whether comparisons are controlled for parameter counts given the internal contradictions.
  • Writing/organization problems substantially hinder understanding: redundant section titles ("BACKGROUND AND BACKGROUND"), repeated/circular statements, typos, and mismatched notation throughout (e.g., inconsistent references to tasks, models, and metrics).

❓ Questions

  • Coherent supervision: How exactly is the "ground truth coherent probability" x_t^* obtained for each token (Section 3.2)? Is coherence annotated, heuristically derived, or defined as non-[MASK]? If heuristic, please provide the precise rule and its justification.
  • Loss definition: In Eq. 4–5, y_bce(x) appears to take x as a binary target with model probabilities p_theta(x) and p_theta(1−x). What is the modeled variable here (coherent vs. incoherent)? Please give the exact parameterization of p_theta and clarify how the one-hot x_t^* is used within BCE. Also clarify how Eq. 3’s indicator I[t in [s, t + r]] is determined.
  • Time shift: What is the formal motivation for the time shift γ, beyond the statement about the end of text? Why not use an EoS token? Please provide ablations isolating γ_π and γ_coh with confidence intervals and explain how γ interacts with the sampler equations (Eqs. 10, 12).
  • Parameter counts: Reconcile the contradiction between claims of "75x fewer parameters" vs. "identical number of parameters" with MLD/MDLM. Provide a table listing parameter counts for every compared model (base, noise/coherent components), and ensure all speedups/FLOPs are normalized to equal parameter counts.
  • Compute complexity: You claim inference and training are constant w.r.t length, but also state training FLOPs grow as O(n k^2) due to RoPE (Section 3.4). Please resolve this inconsistency and provide precise per-step and per-token complexity for all methods compared.
  • OOD perplexity (Eq. 13): The formulation DA%(δ) = E[PPL(I(...))] is unclear and appears incorrect. Please provide a mathematically precise definition of the metric, including the role of δ, the expectation domain, and how masking ratios map to perplexity curves.
  • Datasets and tasks: Please correct the dataset names and task modalities (e.g., Sudoku/Selenium/Suoko/Suko, MNIST as image vs. text). Explain how Sudoku is converted to the proposed masked diffusion setup and how [MASK] selection ties to solver evaluation.
  • Baselines: How are MLD/MDLM implemented here (especially with RoPE)? What exact samplers (score vs. deterministic) and guidance settings are used per method? Are these tuned equally for all methods on validation data?
  • Ablation table (Table 2): Rows labeled "ℓcoh" appear to outperform the full CoMD in some cases, contradicting the text that ℓ_coh helps most. Please verify the table contents and reconcile with the narrative. If labeling indicates removal of components, rename rows unambiguously (e.g., "w/o ℓcoh").
  • Reproducibility: Please include random seeds, exact training schedules, optimizer configs, data splits, and code for the sampler. If possible, release code or a minimal reproducible script for at least one benchmark (e.g., SlimPajama).

⚠️ Limitations

  • The paper’s notion of coherence is not operationally defined and may collapse to a heuristic related to masking ratios (Eq. 6), which risks entangling the loss with the corruption process rather than linguistic or semantic coherence.
  • If the coherent model acts as a filter over a subset of tokens, it could bias generation toward frequent or superficial patterns, potentially degrading diversity and reinforcing dataset biases.
  • Claims about constant compute with respect to sequence length conflict with the stated O(n k^2) cost due to RoPE; if the latter holds, scaling to long sequences may still be costly.
  • Evaluation focuses on synthetic or narrow tasks (e.g., logic puzzles as described) with unclear metrics and lacks qualitative analyses of generated text; this limits conclusions about real-world utility or robustness.
  • Potential negative societal impacts are not discussed. If the coherent loss implicitly rewards conformity to majority patterns, it might amplify social biases present in training data. Consider fairness audits or toxicity/diversity analyses.

🖼️ Image Evaluation

Cross‑Modal Consistency: 18/50

Textual Logical Soundness: 10/30

Visual Aesthetics & Clarity: 8/20

Overall Score: 36/100

Detailed Evaluation (≤500 words):

1. Cross‑Modal Consistency

• Major 1: Claimed 7.3×/10.5× inference speedups are not supported by reported times. Evidence: Abstract: “achieves an inference speedup of 7.3x and 10.5x”; Table 1: MDLM 61.01 s vs CoMD 61.41 s.

• Major 2: “Constant” compute vs later O(nk²) contradiction. Evidence: Abstract/§3: “Both inference and training computation are constant…”; §3.4: “total FLOPs per second grow as O(nk²)”.

• Major 3: Task/caption mismatch (Sudoku vs Selenium). Evidence: §4.2: “The Sudoku benchmark…”; Table 1 caption: “Length 512 Selenium Test”.

• Major 4: Missing cross‑refs. Evidence: §3.1: “(Table ??)”; §3.4: “(Table ??)”.

• Minor 1: Model naming wavers (MDLM vs MLDLM), e.g., §1–2 use both for different concepts.

• Minor 2: Dataset names vary (Sudoku/Sudo/Suoko/Suko).

2. Text Logic

• Major 1: Loss/sampler definitions are ill‑posed or syntactically invalid. Evidence: Eq. (11) subtracts an indicator term; Eqs. (9–10) have mismatched indices/parentheses.

• Major 2: Undefined or nonstandard constructs block understanding. Evidence: §2 MLD: “hyper‑geometric interlacing distribution,” “k‑lines one‑hot vector” (not defined).

• Minor 1: Repetition/duplication (e.g., §4.1 “For image data…” appears twice).

• Minor 2: Claims of 75× parameter savings and 1.7–1.8× FLOP reductions lack concrete tables/figures specifying model sizes/training regimes.

3. Figure Quality

Visual ground truth

• Table 1: Columns Model, Acc.%↑, DA%↓, Time(s)↓; task caption says “Selenium Test” though text discusses Sudoku.

• Table 2: Labeled (a) accuracy on “Selenium” and (b) perplexity on SlimPajama; presented as a single HTML table with two “Model” blocks; rows “k Learned,” “ℓcoh,” “γ,” “ℓcoh-γ” are ablations.

• Major 1: DA% metric undefined/contradictory. Evidence: §4.2: “DA% (area under the OOD perplexity)”; Table 1 header “DA%↓” without units/definition; Eq. (13) is incoherent.

• Major 2: Table 2 merges two sub‑tables into one, obscuring which columns belong to (a) vs (b). Evidence: “Table 2: Ablations on Selenium and SlimPajama Test (a)… (b)…”.

• Minor 1: Abbreviations (↑/↓, DA%) and dataset names not explained in captions; readers cannot interpret tables alone.

• Minor 2: Inconsistent precision (e.g., “1.13±2.2”) suggests formatting/measurement issues.

Key strengths:

  • Ambitious attempt to incorporate a coherence‑aware loss and time‑shift into masked diffusion LMs.
  • Ablation intent (fixed mask, loss, time‑shift) could be insightful if clarified and cleanly reported.

Key weaknesses:

  • Pervasive figure–text mismatches (missing tables, mislabeled tasks, unsupported speedup claims).
  • Core math is under‑specified/incorrect, blocking reproducibility (loss/sampler equations, undefined distributions).
  • Tables are confusing and insufficiently captioned; key metrics (DA%) are undefined or incoherent.

📊 Scores

Originality:3
Quality:1
Clarity:1
Significance:2
Soundness:1
Presentation:1
Contribution:2
Rating: 2

AI Review from SafeReviewer

SafeReviewer Review available after:
--d --h --m --s

📋 AI Review from SafeReviewer will be automatically processed

📋 Summary

This paper introduces Coherent Masked Diffusion (CoMD), a novel framework designed to enhance the generation of coherent text using masked diffusion models. Building upon the existing Masked Language Diffusion (MLD) model, CoMD incorporates three key innovations: a fixed mask matrix, a coherent loss term, and a variable time parameter. The fixed mask matrix aims to improve computational efficiency by avoiding the need to recompute the mask at each step, while the coherent loss term is designed to guide the model towards generating more coherent text by explicitly modeling the coherent and incoherent distributions. The variable time parameter is introduced to better model the end of coherent text. The authors evaluate CoMD on a variety of tasks, including image generation (Sudoku), text generation (logic puzzles and SlimPajama), and language modeling. The empirical results demonstrate that CoMD achieves performance improvements over baseline models, particularly in terms of perplexity and accuracy on the Sudoku task. However, the paper also acknowledges that the model does not outperform autoregressive models in all tasks, and that the computational efficiency gains are primarily observed during inference, not training. The authors claim that CoMD is significantly more compute and parameter efficient than autoregressive models, but this claim is not fully supported by the experimental results. Overall, the paper presents an interesting approach to improving masked diffusion models for text generation, but it also highlights several areas that require further investigation and clarification.

✅ Strengths

I found several aspects of this paper to be commendable. The introduction of a fixed mask matrix is a notable contribution, as it addresses the computational overhead associated with dynamic masking in previous masked diffusion models. This approach has the potential to significantly improve the efficiency of these models, particularly during inference. The authors' attempt to explicitly model coherent and incoherent distributions through the introduction of a coherent loss term is also a valuable contribution. This approach demonstrates a clear understanding of the challenges associated with generating coherent text and represents a novel attempt to address these challenges within the context of masked diffusion models. The experimental results, particularly on the Sudoku task, demonstrate the effectiveness of the proposed approach. The authors show that CoMD achieves a lower perplexity and higher accuracy compared to baseline models, indicating that the proposed innovations are indeed beneficial. The paper also provides a detailed description of the proposed method, including the mathematical formulations and implementation details. This level of detail is essential for reproducibility and allows other researchers to build upon the work presented in this paper. Finally, the authors have conducted a comprehensive ablation study to evaluate the impact of each component of CoMD, which is crucial for understanding the contribution of each innovation.

❌ Weaknesses

While the paper presents several interesting ideas, I have identified several weaknesses that warrant further discussion. Firstly, the paper's claim of significant computational efficiency compared to autoregressive models is not fully supported by the experimental results. While the authors demonstrate that CoMD achieves a speedup over other masked diffusion models, the comparison with autoregressive models is not direct, as different models are used (e.g., TinyLLaMA vs. Llama 2). Furthermore, the paper does not provide a detailed analysis of the computational cost of training, which is a significant factor when considering the overall efficiency of a model. The authors acknowledge that training computation grows as O(nk^2), but they do not provide a direct comparison of training time or FLOPs with autoregressive models. This makes it difficult to assess the true computational benefits of CoMD. Secondly, the paper's explanation of the coherent loss term and the concept of coherent probability is not sufficiently clear. The authors introduce the notation 'p_e(x^c_t)' without a clear definition, and the explanation of how the coherent loss term guides the model towards coherent text is somewhat vague. The paper states that the coherent loss term guides the model towards the 'ground truth coherent probability', but it does not provide a detailed explanation of what this ground truth probability represents or how it is calculated. This lack of clarity makes it difficult to fully understand the mechanism by which the coherent loss term improves the generation of coherent text. Thirdly, the paper's use of the term 'coherent' is not always precise. While the paper provides some context for the term, it does not offer a formal definition, and the connection to existing literature on coherence in NLP is not explicitly made. This lack of a clear definition makes it difficult to assess the validity of the paper's claims about improving coherence. Fourthly, the paper's experimental evaluation is limited in several respects. The paper does not include a comparison with a standard next-token-prediction autoregressive model on the same tasks, which makes it difficult to assess the relative performance of CoMD. The paper also does not evaluate the model on more challenging tasks that require long-range coherence, such as document-level text generation or story generation. This limits the generalizability of the paper's findings. Finally, the paper's writing style is not always clear and concise. The mathematical notation is not always introduced before it is used, and the paper contains several typos and grammatical errors. This makes the paper difficult to read and understand, and it detracts from the overall quality of the work. For example, the paper uses the notation 'p_e(x^c_t)' before defining it, and the paper refers to 'MDLM' as 'MDLMs' inconsistently. The paper also states that the model is trained for '~3Ok steps', which is likely a typo. These issues, while seemingly minor, collectively impact the clarity and credibility of the paper.

💡 Suggestions

Based on the identified weaknesses, I recommend several concrete improvements. First, the authors should provide a more detailed analysis of the computational cost of training CoMD, including a direct comparison of training time and FLOPs with autoregressive models. This would provide a more accurate assessment of the computational efficiency of the proposed approach. Second, the authors should provide a more precise definition of the coherent probability and a more detailed explanation of how the coherent loss term guides the model towards coherent text. This should include a clear explanation of what the 'ground truth coherent probability' represents and how it is calculated. Third, the authors should provide a formal definition of the term 'coherent' and explicitly connect it to existing literature on coherence in NLP. This would clarify the paper's claims about improving coherence and allow for a more rigorous evaluation of the proposed approach. Fourth, the authors should include a comparison with a standard next-token-prediction autoregressive model on the same tasks. This would provide a more direct comparison of the performance of CoMD and autoregressive models. The authors should also evaluate the model on more challenging tasks that require long-range coherence, such as document-level text generation or story generation. This would provide a more comprehensive assessment of the model's capabilities. Fifth, the authors should revise the paper to improve the clarity and conciseness of the writing. This should include introducing all mathematical notations before they are used, correcting all typos and grammatical errors, and ensuring that the paper is easy to read and understand. For example, the authors should define 'p_e(x^c_t)' before using it, and they should correct the typo '3Ok steps'. Finally, the authors should address the inconsistencies in the use of 'MDLM' and 'MDLMs'. These changes would significantly improve the quality and credibility of the paper.

❓ Questions

I have several questions that arise from my analysis of the paper. First, what is the precise definition of the 'ground truth coherent probability' that the coherent loss term is designed to guide the model towards? How is this probability calculated, and what is its relationship to the actual coherent distribution of the data? Second, how does the fixed mask matrix affect the model's ability to capture long-range dependencies in text? Does the fixed mask matrix limit the model's ability to adapt to different sentence structures and lengths? Third, what is the impact of the variable time parameter on the model's performance? How does the choice of the time shift parameter affect the model's ability to generate coherent text? Fourth, how does the performance of CoMD compare to state-of-the-art autoregressive models on more challenging tasks that require long-range coherence, such as document-level text generation or story generation? Fifth, what is the computational cost of training CoMD, and how does it compare to the computational cost of training autoregressive models? What are the trade-offs between the computational efficiency of CoMD and its performance on various tasks? Finally, how does the model handle out-of-vocabulary tokens, and what is the impact of the vocabulary size on the model's performance? These questions target core methodological choices and assumptions, and they seek clarification of critical uncertainties in the paper.

📊 Scores

Soundness:2.0
Presentation:1.5
Contribution:2.0
Confidence:3.75
Rating: 3.0

Keywords

Click the button to extract keywords

Insights

Click the button to extract insights
Version 1
Citation Tools

📝 Cite This Paper