HEAL: Learning-Free Source Free Unsupervised Domain Adaptation for Cross-Modality Medical Image Segmentation

Paper Content

📄 Open in New Tab

🎓 Meta Review & Human Decision

Decision:

Meta Review:

AI Review from DeepReviewer

AI Review available after:

--d --h --m --s

📋 AI Review from DeepReviewer will be automatically processed

📋 Summary

This paper introduces HEAL, a novel framework for Source-Free Unsupervised Domain Adaptation (SFUDA) in cross-modality medical image segmentation. The core contribution of HEAL lies in its ability to adapt a pre-trained model to a new, unlabeled target domain without requiring any target-specific training or parameter updates. This is achieved through a combination of hierarchical denoising, edge-guided selection, and size-aware fusion. The hierarchical denoising process refines initial pseudo-labels using entropy and Normal-Inverse Gaussian (NIG) uncertainty, aiming to mitigate error accumulation. Edge-guided selection employs a diffusion model to generate multiple samples conditioned on the refined pseudo-labels, selecting the most reliable sample based on structural consistency. Finally, size-aware fusion dynamically integrates the selected sample with the refined pseudo-labels based on the size of the segmentation targets. The authors evaluate HEAL on two medical image segmentation tasks: brain tumor segmentation (T1->T1ce and T2->FLAIR) and polyp segmentation (Kvasir-SEG to CVC-ClinicDB and vice versa). The experimental results demonstrate that HEAL outperforms existing SFUDA methods, highlighting its potential for practical applications where access to target data is restricted. The paper emphasizes the 'learning-free' characteristic of HEAL, which enhances computational efficiency and simplifies deployment, while also preserving the integrity of the pre-trained source model. This approach is particularly relevant in medical imaging, where data privacy and the availability of labeled data are significant challenges. The authors provide ablation studies to demonstrate the contribution of each component of the framework. Overall, the paper presents a well-structured and clearly articulated approach to SFUDA, with promising empirical results in the context of medical image segmentation. However, as I will discuss in the weaknesses section, there are several areas that require further investigation and clarification to fully assess the robustness and generalizability of the proposed method.

✅ Strengths

I find several aspects of this paper to be commendable. The core idea of a 'learning-free' SFUDA framework, where no target-specific training is required, is a significant strength. This approach not only enhances computational efficiency but also simplifies deployment, making it particularly attractive for real-world medical applications where computational resources may be limited. The integration of hierarchical denoising, edge-guided selection, and size-aware fusion is another positive aspect. While these components are not entirely novel in isolation, their specific combination and application within the SFUDA context are well-executed and contribute to the overall effectiveness of the method. The hierarchical denoising process, which leverages both entropy and NIG uncertainty, is a clever way to refine pseudo-labels and mitigate error propagation. The edge-guided selection, using a diffusion model to generate multiple samples and selecting the most consistent one, is a robust approach to address the uncertainty inherent in pseudo-labels. The size-aware fusion, which dynamically integrates the selected sample with the refined pseudo-labels based on target size, is a practical and effective way to improve segmentation accuracy. The experimental results are also a strong point. The authors demonstrate that HEAL outperforms existing SFUDA methods on two different medical image segmentation tasks, using multiple datasets and domain adaptation directions. The ablation studies provide valuable insights into the contribution of each component of the framework, further supporting the effectiveness of the proposed approach. The paper is also well-written and organized, making it easy to follow the proposed methodology and understand the contributions. The introduction effectively sets the stage for the problem of domain shift in medical image segmentation and the need for SFUDA approaches. The figures and tables are clear and informative, further enhancing the readability of the paper. Overall, the combination of a novel 'learning-free' approach, effective integration of existing techniques, strong empirical results, and clear presentation makes this a valuable contribution to the field of medical image segmentation.

❌ Weaknesses

Despite the strengths of this paper, I have identified several weaknesses that warrant further discussion. First, while the authors present HEAL as a novel framework, the core components—hierarchical denoising, edge-guided selection, and size-aware fusion—are based on existing techniques. The novelty, therefore, lies primarily in the specific integration of these components within the SFUDA context. While this integration is effective, the paper lacks a detailed comparison with existing methods that employ similar techniques. This makes it difficult to fully assess the unique contribution of HEAL and its advantages over other possible combinations of these techniques. For example, the paper could have compared its hierarchical denoising approach with other uncertainty-based pseudo-label refinement techniques or its edge-guided selection with other methods that leverage structural information for sample selection. This lack of detailed comparison makes it difficult to position the contribution of this work accurately. My confidence in this assessment is high, as the paper itself describes the components as integrations of existing techniques, and a more detailed comparison with similar existing methods is missing. Second, the paper lacks a thorough discussion of the computational efficiency of HEAL. While the authors claim that HEAL is computationally efficient due to the absence of target-specific training, they do not provide a detailed analysis of the inference time or resource requirements. This is a critical aspect, especially for medical image segmentation tasks where real-time or near-real-time performance is often required. The paper should include a breakdown of the computational cost associated with each step of the HEAL framework, such as the diffusion model sampling, edge-guided selection, and size-aware fusion. Without this information, it is difficult to assess the practical applicability of HEAL, particularly in resource-constrained environments. My confidence in this assessment is high, as the paper mentions the hardware used for experiments but provides no quantitative analysis of inference time or resource requirements for the different components of HEAL. Third, the paper only evaluates HEAL on two medical image segmentation tasks: brain tumor segmentation and polyp segmentation. While these are important tasks, they may not be representative of all medical image analysis problems. It would be beneficial to evaluate HEAL on other tasks, such as classification, detection, or registration, to demonstrate its generalizability and robustness. The limited scope of evaluation tasks raises questions about the applicability of HEAL to other medical image analysis problems. My confidence in this assessment is high, as the paper explicitly mentions evaluating HEAL only on two segmentation tasks. Fourth, the paper does not compare HEAL with other non-SFUDA methods that use source or target data during adaptation. While the paper compares HEAL with 'No Adaptation' and 'Supervised' baselines, it lacks a comparison with other explicit non-SFUDA *adaptation* methods. This makes it difficult to understand the trade-offs and advantages of HEAL's source-free approach compared to methods that leverage source or target data during adaptation. My confidence in this assessment is high, as the paper compares with 'No Adaptation' and 'Supervised' baselines but lacks comparison with other explicit non-SFUDA *adaptation* methods. Fifth, the paper does not discuss the ethical implications of using SFUDA methods, which may raise concerns about data privacy, security, or bias. While the paper highlights the 'learning-free' aspect as a way to preserve the source model and reduce risks associated with target data, it does not explicitly discuss broader ethical implications. This lack of discussion is a valid concern, especially given the increasing focus on responsible AI in healthcare. My confidence in this assessment is high, as the paper does not contain a dedicated section or discussion on the ethical implications of using SFUDA methods. Sixth, the paper proposes using a diffusion model to generate samples conditioned on pseudo-labels, followed by edge detection on both the generated samples and pseudo-labels to select samples with high consistency. However, since pseudo-labels are generated from the model trained on the source domain, there is a significant risk that both the pseudo-labels and the generated samples will contain many erroneous regions. This raises concerns about the reliability of using edge consistency between the generated samples and pseudo-labels as a criterion for sample selection. My confidence in this assessment is high, as the method description confirms the use of pseudo-labels for diffusion model conditioning and edge-based selection, raising concerns about the impact of pseudo-label errors. Seventh, the paper proposes using HD-refined pseudo-labels to calculate NIG distribution parameters, which are then used to generate a binary mask for refining the pseudo-labels. However, since the pseudo-labels initially contain many erroneous regions, there is a risk that the NIG distribution may not be accurately calculated, which could in turn affect the accuracy of the subsequent binary mask generation and further compromise the reliability of the refined pseudo-labels. My confidence in this assessment is high, as the method description shows that NIG parameters are calculated based on the entropy-refined pseudo-labels, which might still contain errors. Finally, I identified a potential typo in the second formula of Section 2.1.1, where the term 'T_h' is used instead of 'T_1'. This is a minor issue, but it could lead to incorrect calculations. My confidence in this assessment is high, as the term 'T_h' in the second formula of Section 2.1.1 appears to be a typo and likely should be 'T_1'.

💡 Suggestions

To address the identified weaknesses, I recommend several concrete improvements. First, the authors should provide a more rigorous analysis of the novelty of their approach. While the combination of existing techniques is effective, the paper needs to clearly articulate what makes this specific integration unique and superior to other possible combinations. A detailed comparison with existing methods that use similar components, highlighting the specific differences and advantages of HEAL, would be beneficial. For example, the authors could compare their hierarchical denoising approach with other uncertainty-based pseudo-label refinement techniques, or their edge-guided selection with other methods that leverage structural information for sample selection. This would help to better position the contribution of the work and justify its novelty. Second, the paper should include a comprehensive analysis of the computational efficiency of the proposed method. This should include a detailed breakdown of the inference time for each component of the HEAL framework, such as the diffusion model sampling, edge-guided selection, and size-aware fusion. The authors should also provide a comparison of the computational cost of HEAL with existing SFUDA methods, including both training and inference times. This analysis should be performed on a standard hardware setup and should include the memory requirements of the method. This would provide a more complete picture of the practical applicability of HEAL and allow readers to assess its suitability for real-world medical image segmentation tasks. The authors should also discuss the potential for optimizing the implementation of HEAL to further improve its computational efficiency. Third, the authors should evaluate HEAL on a wider range of medical image analysis tasks. While segmentation is a crucial task, other tasks such as classification, detection, and registration are also important in medical image analysis. Evaluating HEAL on these tasks would provide a more comprehensive understanding of its capabilities and limitations. For example, the authors could evaluate HEAL on a classification task using a dataset of medical images with different pathologies, or on a detection task using a dataset of medical images with multiple lesions. Furthermore, the authors should also consider evaluating HEAL on datasets with different imaging modalities, such as CT, MRI, and ultrasound, to assess its robustness to different types of data. This would provide a more complete picture of the applicability of HEAL in different clinical scenarios. The authors should also discuss the potential challenges and limitations of applying HEAL to these other tasks and provide suggestions for future research. Fourth, the paper should include a comparison with other non-SFUDA methods that use source or target data during adaptation. This would provide a better understanding of the trade-offs and advantages of HEAL's source-free approach. Fifth, the paper should include a more thorough discussion of the ethical implications of using SFUDA methods in medical image analysis. While SFUDA methods offer the advantage of not requiring access to source data, they may still raise concerns about data privacy, security, and bias. The authors should discuss these issues in detail and provide guidelines for responsible use of SFUDA methods. For example, the authors should discuss the potential for bias in the target data and how this may affect the performance of HEAL. They should also discuss the potential for adversarial attacks on the target data and how these attacks may compromise the security of the method. Furthermore, the authors should discuss the potential for misuse of SFUDA methods and how this can be prevented. This discussion should be based on a thorough understanding of the ethical principles and guidelines for using AI in healthcare. The authors should also provide recommendations for future research on the ethical implications of SFUDA methods. Sixth, the authors should explore alternative strategies for generating more reliable conditioning masks for the diffusion model. One approach could involve incorporating uncertainty estimates from the initial pseudo-labels to guide the diffusion process. For instance, instead of directly conditioning on the hard pseudo-labels, the diffusion model could be conditioned on a probability map derived from the softmax output of the segmentation model, which would provide a measure of the model's confidence in each pixel classification. This would allow the diffusion model to focus on generating samples that are consistent with the high-confidence regions of the pseudo-labels, while also exploring the uncertainty regions to potentially correct errors. Furthermore, the authors could explore using a iterative refinement process where the diffusion model generates samples, which are then re-segmented to produce new pseudo-labels, and this process is repeated. This iterative approach could gradually improve the quality of the pseudo-labels and the consistency between the generated samples and the true target domain distribution. Seventh, the authors should explore alternative selection criteria that are less sensitive to noisy pseudo-labels. For example, they could explore using a combination of edge consistency and other metrics, such as the similarity of the intensity distributions between the generated samples and the target images. Additionally, the authors could investigate using a learned similarity metric, where a separate network is trained to predict the quality of the generated samples based on their similarity to the target domain. This would allow for a more robust and adaptive selection process that is less reliant on the accuracy of the initial pseudo-labels. The authors should also provide a more detailed analysis of the impact of the edge consistency threshold on the performance of the method, as the optimal threshold may vary depending on the specific dataset and the quality of the initial pseudo-labels. Finally, the authors should explore alternative methods for refining the pseudo-labels that are less sensitive to the initial quality of the pseudo-labels. For example, they could consider using a moving average of the pseudo-labels over multiple iterations, which would help to smooth out the noise and reduce the impact of individual erroneous predictions. They could also explore using a conditional random field (CRF) to refine the pseudo-labels, which would enforce spatial consistency and improve the quality of the segmentation boundaries. Furthermore, the authors should provide a more detailed analysis of the impact of the NIG distribution parameters on the performance of the method, as the optimal parameters may vary depending on the specific dataset and the quality of the initial pseudo-labels. A sensitivity analysis of these parameters would be crucial to understand the robustness of the proposed method. Finally, the authors should correct the typo in the second formula of Section 2.1.1, replacing 'T_h' with 'T_1'.

❓ Questions

Based on my analysis, I have several questions that I believe are important for further understanding the proposed method. First, how sensitive is the performance of HEAL to the choice of hyperparameters, such as the entropy threshold T1 and the NIG distribution variance threshold T2? The paper mentions setting these thresholds to 0.2 but does not justify this choice or explore the sensitivity of the method to different values. A more thorough analysis of how these thresholds affect the performance of HEAL is needed, including a discussion of the trade-offs involved in selecting different values. Second, can the authors provide more details on the computational cost of the diffusion model component, especially in terms of inference time and resource requirements? The paper lacks a detailed analysis of the computational cost or inference time of HEAL. It claims computational efficiency due to the 'learning-free' nature but doesn't quantify it. A breakdown of the computational cost associated with each step of the HEAL framework, such as the diffusion model sampling, edge-guided selection, and size-aware fusion, is needed. Third, how does HEAL compare with other non-SFUDA methods in terms of performance, computational cost, and ethical implications? The paper compares HEAL with 'No Adaptation' and 'Supervised' baselines, but it lacks a comparison with other explicit non-SFUDA *adaptation* methods. A comparison with other methods that use source or target data during adaptation would provide a better understanding of the trade-offs and advantages of HEAL's source-free approach. Fourth, how does HEAL perform on other medical image analysis tasks, such as classification, detection, or registration? The paper only evaluates HEAL on two medical image segmentation tasks. Evaluating HEAL on other tasks would provide a more comprehensive understanding of its capabilities and limitations. Fifth, how does HEAL handle the cases where the target domain has multiple modalities or heterogeneous data sources? The paper does not discuss the potential challenges and limitations of applying HEAL to more complex scenarios. Sixth, how does HEAL address the ethical issues of data privacy, security, and bias in medical image analysis? The paper does not explicitly discuss the ethical implications of using SFUDA methods. A more thorough discussion of these issues is needed. Finally, what is the impact of the number of generated samples 'n' in the edge-guided selection process on the performance of HEAL? The paper mentions generating 'n' samples but does not discuss the impact of this parameter on the quality of the selected samples and the overall performance of the method. A discussion of the trade-offs between computational cost and performance when varying 'n' would be beneficial.

📊 Scores

Soundness:2.25

Presentation:2.5

Contribution:2.5

Rating: 5.5

AI Review from ZGCA

ZGCA Review available after:

--d --h --m --s

📋 AI Review from ZGCA will be automatically processed

📋 Summary

The paper proposes HEAL, a learning-free, inference-only source-free unsupervised domain adaptation (SFUDA) framework for cross-modality medical image segmentation. A source-trained segmentation model and a source-trained diffusion model are used without any parameter updates on the target domain. The pipeline comprises: (1) Hierarchical Denoising (HD), which refines pseudo-labels from the source model on the target images using a two-stage uncertainty filter (voxel-wise entropy then Normal-Inverse Gaussian variance), (2) Edge-Guided Selection (EGS), which generates multiple source-like samples via a diffusion model conditioned on refined pseudo-labels and selects the one with maximal edge consistency (Canny alignment) to the condition, and (3) Size-Aware Fusion (SAF), which fuses small structures from the HD-refined pseudo-labels with large structures from the segmentation of the selected source-like sample. Experiments on BraTS 2021 (T1→T1ce, T2→FLAIR) and two polyp datasets (Kvasir-SEG↔CVC-ClinicDB) report strong improvements over several SFUDA baselines. Ablations analyze HD, EGS, and SAF; qualitative visualizations and t-SNE plots aim to explain component contributions.

✅ Strengths

Conceptually novel learning-free, inference-only SFUDA pipeline; no target-domain parameter updates (Section 1, learning-free characteristic).
Clever integration of a source-trained diffusion model to re-render target structures as source-like images, sidestepping direct cross-modality recognition at inference (Sections 2.2, Conclusion).
Modular design (HD, EGS, SAF) that targets complementary failure modes: denoise pseudo-labels, select structurally consistent generated samples, and fuse by size (Sections 2.1–2.3).
Compelling reported gains on BraTS across WT/TC/ET vs SFUDA baselines (Table 1) and competitive results on polyps (Table 2).
Useful ablations and qualitative analysis: HD vs Entropy-only (Table 3), HD+EGS vs full HEAL (Figure 2), uncertainty maps and qualitative improvements (Figure 3), feature-space visualization (Figure 4).

❌ Weaknesses

Inconsistent statistical reporting and lack of significance testing for the central brain tumor results (Table 1 reports only means, unlike Table 2). This undermines claims of robustness and SOTA gains.
Ambiguous evaluation protocol for SFUDA: the paper states no target split is used since target is not used for training, yet several baselines (e.g., UPL) do train on target unlabeled data (Section 3.1.1). It is unclear whether baselines were adapted and evaluated on the same target data or whether a fair held-out target test set was used for all methods.
Method clarity gaps: NIG denoising specifies functional dependence for α, β, ω on Y_T and entropy, but γ is not explicitly defined, and the positivity/α>1 conditions are not guaranteed by the chosen constants (Section 2.1.2). The hyperparameters κ, ζ1, ζ2, η1, η2 are not reported in Implementation Details, hampering reproducibility.
EGS relies on edge alignment between generated images and Y_T*: sensitivity to noisy pseudo-labels and edge thresholds (Canny typically uses two thresholds) is not studied; only a single 0.1 threshold is reported (Section 3.1.2). No ablation on number of generated samples n or diffusion steps is provided.
SAF may be ill-specified for BraTS composite regions (WT, TC, ET are derived composites in standard evaluation). It is unclear how per-class fusion is defined when the target regions are not mutually exclusive in the metrics pipeline (Section 2.3).
Claims of computational efficiency and privacy are asserted but not empirically quantified. Diffusion sampling (n=6, 250 steps) for 3D MRI can be costly; no runtime or memory comparison vs self-training SFUDA baselines is provided.
Reproducibility omissions: no random seeds, incomplete batch-size details across stages, no hyperparameter search protocol, limited details for polyp generation (2D diffusion vs cited 3D Med-DDPM).

❓ Questions

Evaluation protocol: Did the compared SFUDA baselines adapt on the same target set that was also used for evaluation, or was a held-out target test set used for all methods? Please clarify the target-domain splitting protocol for HEAL, No Adaptation, Supervised, and each baseline to ensure fairness.
Statistical rigor: For Table 1 (BraTS), please report results over multiple runs with mean±std and provide statistical significance tests versus the strongest baseline(s).
NIG denoising details: Please provide explicit formulas for γ and the exact numeric values for κ, ζ1, ζ2, η1, η2 and ε. How do you ensure α>1 everywhere so that Var(NIG) is well-defined? Include a sensitivity analysis to these hyperparameters and to the thresholds τ1, τ2.
EGS robustness: Canny commonly uses two thresholds; what exact parameters were used for both the generated images and Y_T*? Please ablate the number of generated samples n and the Canny thresholds, and report how performance varies.
Diffusion conditioning and modality specifics: In Section 2.2 you condition on Y_T^*, while the Conclusion states conditioning on the target image. Which is correct? For polyp segmentation (2D), what diffusion architecture was used (since Med-DDPM is 3D)? Please detail training data, conditioning signal, and sampling schedule for both brain and polyp domains.
SAF and BraTS compositional regions: WT/TC/ET in BraTS are composite masks for evaluation. How is SAF implemented to avoid logical inconsistencies when fusing classes that are not mutually exclusive under standard metric derivation?
Runtime and compute: Please report wall-clock times, GPU memory, and total FLOPs for HEAL’s target inference (including diffusion sampling) relative to training-based SFUDA methods. Include a run-time vs accuracy trade-off (varying n or diffusion steps).
Privacy: Can you discuss and, if possible, empirically evaluate memorization risks in the diffusion model when generating source-like images? How does this affect the privacy claim compared to training-based SFUDA on target data?

⚠️ Limitations

Performance depends on the initial pseudo-label quality; under large domain shift, entropy/NIG denoising may mask out too much signal or reinforce erroneous structure (acknowledged in the paper’s limitations).
EGS can be brittle if Y_T* edges are weak or noisy; selection may prefer structurally inconsistent generations when the condition itself is unreliable.
The approach requires a well-trained, domain-specific diffusion model on the source; training and maintaining such a model for each source domain is non-trivial and computationally expensive.
Potential privacy considerations: diffusion models can memorize training data; using them to synthesize source-like images might carry leakage risks if not properly evaluated.
Scalability and efficiency: diffusion sampling for 3D volumes is expensive; the paper claims efficiency due to no target training but does not provide runtime evidence or trade-off studies.
Methodological ambiguity for composite-region metrics (BraTS) when applying size-aware fusion across non-mutually-exclusive targets.

🖼️ Image Evaluation

Cross-Modal Consistency: [36]/50

Textual Logical Soundness: [22]/30

Visual Aesthetics & Clarity: [15]/20

Overall Score: [73]/100

Detailed Evaluation (≤500 words):

1. Cross-Modal Consistency

• Minor 1: Visual ground truth

– Figure 1/(a–e): (a) Diffusion pre‑training pipeline; (b) nnUNet pre‑training; (c) HD pipeline with entropy→NIG→Var(NIG) mask; (d) EGS with Canny edges and S-score; (e) SAF fusing MS(IB) with YT*. Overall: end‑to‑end workflow from source pre‑training to target inference.

– Figure 2/(a–d): Bar charts (Dice) and line plots (ASD) for T1→T1ce and T2→FLAIR; configs Baseline/HD/HD+EGS/HEAL.

– Figure 3: Qualitative HD ablation per case: target image, No‑Adapt, entropy map, NIG uncertainty, HD‑refined.

– Figure 4/(a–b): t‑SNE showing Source, Source‑like, Target clusters.

• Major 1: Direction inconsistency. Sec. 3.3 mentions “FLAIR→T2” while the paper uses “T2→FLAIR.” Evidence: “Figure 2 (c)… in the FLAIR → T2 direction.”

• Major 2: Reported ASD values ambiguous vs figures/tables (e.g., “further diminishes ASD to 2.8 mm and 2.6 mm, respectively”). It’s unclear which classes/directions these refer to, and Table 1 lists mean ASD 2.0 and 2.6. Evidence: Sec 3.3 sentence with “2.8 mm and 2.6 mm.”

• Minor 2: Equation (1) text says P(v|c) but formula uses P(c|v).

• Minor 3: Figure 3 column labels (e.g., “T1”, “T1ce”) are not explained in caption; may confuse with direction.

2. Text Logic

• Major 1: “NIG” is referred to as Normal‑Inverse Gaussian, but equations and usage match the Normal‑Inverse‑Gamma prior. This affects the HD derivation and Var(NIG). Evidence: Sec 2.1.2 phrase “Normal-Inverse Gaussian (NIG)” with Eqs. (3–5).

• Minor 1: Missing/awkward punctuation in multiple numeric comparisons (e.g., “16.4% 25.7%”).

• Minor 2: “learning‑free” claim is consistent, but the diffusion model is used at target time; clarify no target‑time fine‑tuning anywhere.

3. Figure Quality

• Major 1: Several critical labels are tiny (Figure 1 icons/text “Reverse Diffusion Process,” “Model Frozen”; Figure 3 colorbars/labels). Risk of illegibility at print size. Evidence: Fig. 1 panels (c–e) dense pipelines with small annotations.

• Minor 1: Figure 2 bars/lines lack numeric labels; hard to verify stated deltas quickly.

• Minor 2: Figure 3 needs clearer column headers and a legend explaining heatmaps/uncertainty units.

Key strengths:

Clear end‑to‑end workflow and genuinely “learning‑free” target adaptation.
Strong quantitative gains; Table 1 supports SOTA among SFUDA for brain tumor; balanced discussion for polyp (best/second‑best).
Ablations and t‑SNE provide complementary evidence.

Key weaknesses:

Core probabilistic prior misnamed (Gaussian vs Gamma) and variance expression unclear.
Direction/metric wording inconsistencies and ambiguous ASD statements.
Overcrowded, small‑font annotations reduce figure self‑sufficiency (especially Fig. 3 and parts of Fig. 1).

Actionable suggestions:

Correct the NIG terminology/derivation, define parameters and the exact variance used.
Standardize direction names (always T2→FLAIR) and specify whether numbers are class‑wise or mean.
Add numeric labels to Fig. 2; enlarge critical text in Fig. 1/3; clarify Fig. 3 columns and add legends.

📊 Scores

Originality:3

Quality:2

Clarity:2

Significance:3

Soundness:2

Presentation:2

Contribution:3

Rating: 4

AI Review from SafeReviewer

SafeReviewer Review available after:

--d --h --m --s

📋 AI Review from SafeReviewer will be automatically processed

📋 Summary

This paper introduces HEAL, a novel source-free unsupervised domain adaptation (SFUDA) framework designed for cross-modality medical image segmentation. The core contribution of HEAL lies in its ability to adapt a pre-trained segmentation model to a target domain without requiring any target-specific training or parameter updates. This is achieved through a combination of hierarchical denoising, edge-guided selection, and size-aware fusion techniques. The method leverages a diffusion model to generate source-like samples from the target domain, which are then used to refine the initial pseudo-labels obtained from the pre-trained model. The hierarchical denoising process employs both entropy and Normal-Inverse Gaussian (NIG) uncertainty measures to refine these pseudo-labels. Edge-guided selection is used to choose the most reliable generated sample based on structural consistency. Finally, size-aware fusion dynamically combines the refined pseudo-labels and the selected generated sample to produce the final segmentation. The authors evaluate HEAL on three public datasets, including two brain tumor segmentation datasets and one polyp dataset, demonstrating its effectiveness in cross-modality medical image segmentation. The experimental results show that HEAL achieves competitive performance compared to existing SFUDA methods. The key idea is that the method operates solely through inference, without any learning or fine-tuning on the target domain, which enhances computational efficiency and preserves the integrity of the pre-trained source model. This approach aims to address the challenges of domain shift and the lack of labeled data in medical imaging, offering a practical solution for adapting models to new, unseen target domains. The paper's emphasis on a 'learning-free' adaptation process, where the pre-trained model's parameters remain fixed, is a central theme. However, this aspect also raises questions about the true novelty and the scope of the method's applicability, particularly in scenarios with significant domain shifts.

✅ Strengths

I find several aspects of this paper to be commendable. The core idea of a source-free unsupervised domain adaptation method that operates without any target-specific training is both practically relevant and technically interesting. The 'learning-free' characteristic, where the pre-trained model's parameters remain fixed, is a significant advantage in terms of computational efficiency and ease of deployment. This approach avoids the need for fine-tuning or self-training on the target domain, which can be computationally expensive and potentially introduce privacy concerns. The proposed method, HEAL, integrates several innovative components, including hierarchical denoising, edge-guided selection, and size-aware fusion. The hierarchical denoising process, which combines entropy and NIG uncertainty measures, is a novel approach to refining pseudo-labels. The edge-guided selection mechanism, which uses structural consistency to choose the most reliable generated sample, is also a valuable contribution. The size-aware fusion technique, which dynamically combines the refined pseudo-labels and the selected generated sample, further enhances the segmentation performance. The experimental results presented in the paper are compelling. The authors demonstrate the effectiveness of HEAL on three public datasets, including two brain tumor segmentation datasets and one polyp dataset. The results show that HEAL achieves competitive performance compared to existing SFUDA methods, indicating the practical utility of the proposed approach. The paper is generally well-organized and easy to follow, making it accessible to a broad audience. The authors provide a clear description of the proposed method and the experimental setup. The inclusion of ablation studies further strengthens the paper by demonstrating the contribution of each component of HEAL. Overall, the paper presents a novel and effective approach to source-free unsupervised domain adaptation for medical image segmentation, with a focus on computational efficiency and ease of deployment. The combination of hierarchical denoising, edge-guided selection, and size-aware fusion represents a significant technical contribution, and the experimental results demonstrate the practical utility of the proposed method.

❌ Weaknesses

Despite the strengths, I have identified several weaknesses that warrant careful consideration. First, the paper's claim of being 'learning-free' is somewhat misleading. While the method does not involve updating the *segmentation model's* parameters on the target domain, it relies on a pre-trained *diffusion model* to generate source-like samples. This diffusion model, while not explicitly trained on the target data, is a crucial component of the adaptation process. The paper lacks details on how this diffusion model is trained, specifically whether it is trained solely on source data or if target data is used in any way. This ambiguity undermines the 'learning-free' claim and raises concerns about potential information leakage from the target domain. The paper states, "In HEAL,the model is exclusively pre-trained on the source domain, and no further training, fine-tuning,or parameter updates are performed during domain adaptation to the target domain." (Introduction). However, the method description states, "First, during the pre-training stage,we train a segmentation model Ms and a diffusion model using source domain data {Xs,Ys}, where Xs are the source domain data and Ys are the label." (Method). This discrepancy highlights the need for clarification on the diffusion model's training process. My analysis confirms that the paper does not provide sufficient detail on the training of the diffusion model, specifically whether it is trained solely on source data or if target data is used in any way. This lack of clarity undermines the claim of being 'learning-free' and raises concerns about potential information leakage from the target domain. This is a significant limitation, and I have high confidence in this assessment. Second, the paper's experimental evaluation is limited in scope. While the authors evaluate HEAL on three datasets, two of them (Kvasir-SEG and CVC-ClinicDB) are endoscopic datasets. Given that the paper focuses on medical image segmentation, it is crucial to include more diverse medical imaging modalities in the evaluation. The absence of a commonly used dataset like the CT-based LUNA16 challenge is a notable omission. The paper states, "We validate our method on the BraTS 2021 dataset Menze et al. (2O14), Kvasir-SEG Ja et al. (2020),and CVC-ClinicDB Bernal et al. (2015)." (Experiments and Results). My analysis confirms that the experimental evaluation is limited to MRI and endoscopic datasets, lacking evaluation on other common medical imaging modalities like CT. This limits the generalizability of the findings. This is a significant limitation, and I have high confidence in this assessment. Third, the paper lacks sufficient detail on the computational cost of the proposed method. While the authors claim that HEAL is computationally efficient due to its 'learning-free' nature, they do not provide any quantitative analysis of the computational cost, such as inference time or memory usage. The paper states, "By eliminating the need for such training,HEAL not only enhances computational efciency and simplifies deployment..." (Introduction). However, the paper does not provide any specific metrics to support this claim. My analysis confirms that the paper lacks specific metrics on computational cost, making it difficult to assess the practical efficiency of the method. This is a significant limitation, and I have high confidence in this assessment. Fourth, the paper's ablation study is not comprehensive enough. While the authors demonstrate the contribution of each component of HEAL, they do not explore the impact of varying the hyperparameters of the diffusion model. The paper states, "We used Med-DDPM Dorjsembe et al. (2024) with a noise schedule of 250 time steps t..." (Implementation Details). However, the paper does not include experiments varying the number of time steps or other diffusion model parameters. My analysis confirms that the paper lacks experiments varying the diffusion model's parameters, which could impact the method's performance. This is a significant limitation, and I have high confidence in this assessment. Finally, the paper does not adequately address the limitations of the proposed method. The authors acknowledge that the effectiveness of HEAL is linked to the generalization capability of the pre-trained segmentation model and that low-quality initial pseudo-labels can propagate errors. However, they do not discuss potential failure cases or scenarios where the method might not perform well. The paper states, "The effectiveness of HEAL is inherently linked to the generalization capability of the pre-trained segmentation model." (Limitations). However, the paper does not provide a detailed analysis of potential failure cases. My analysis confirms that the paper lacks a detailed discussion of potential failure cases and limitations beyond the dependency on the pre-trained model's generalization. This is a significant limitation, and I have high confidence in this assessment.

💡 Suggestions

Based on the identified weaknesses, I propose several concrete suggestions for improving the paper. First, the authors should clarify the training process of the diffusion model. They should explicitly state whether the diffusion model is trained solely on source data or if target data is used in any way. If target data is used, they should explain how this is done without violating the source-free constraint. This clarification is crucial for validating the 'learning-free' claim and addressing concerns about potential information leakage. Second, the authors should expand the experimental evaluation to include more diverse medical imaging modalities. They should include a commonly used dataset like the CT-based LUNA16 challenge to demonstrate the generalizability of the proposed method. This would strengthen the paper's claims and make it more relevant to the broader medical imaging community. Third, the authors should provide a detailed analysis of the computational cost of the proposed method. They should report metrics such as inference time and memory usage for different datasets and segmentation tasks. This would allow readers to assess the practical efficiency of the method and compare it with other SFUDA approaches. Fourth, the authors should conduct a more comprehensive ablation study that includes varying the hyperparameters of the diffusion model. They should explore the impact of different numbers of time steps and other relevant parameters on the performance of HEAL. This would provide a better understanding of the method's sensitivity to these parameters and help identify the optimal settings. Fifth, the authors should provide a more detailed discussion of the limitations of the proposed method. They should discuss potential failure cases and scenarios where the method might not perform well. This would provide a more balanced view of the method's capabilities and limitations. Finally, the authors should consider adding a section on ethical considerations, particularly regarding the use of medical data and the potential for bias in the model's predictions. This would demonstrate a commitment to responsible research practices. By addressing these suggestions, the authors can significantly strengthen the paper and make it more impactful.

❓ Questions

I have several questions that I believe would help clarify some of the key aspects of the paper. First, could the authors provide more details on the training process of the diffusion model? Specifically, is the diffusion model trained solely on source data, or is there any implicit use of target data? If target data is used, how is this done without violating the source-free constraint? Second, what is the computational cost of the proposed method in terms of inference time and memory usage? How does this compare to other SFUDA methods? Third, how sensitive is the performance of HEAL to the hyperparameters of the diffusion model, such as the number of time steps? What is the optimal setting for these parameters? Fourth, what are the potential failure cases for HEAL? Are there specific types of images or segmentation tasks where the method might not perform well? Fifth, how does the performance of HEAL compare to other state-of-the-art SFUDA methods on a wider range of medical imaging datasets, including CT scans? Finally, what are the ethical considerations associated with the use of the proposed method, particularly regarding the use of medical data and the potential for bias in the model's predictions? Addressing these questions would provide a more complete understanding of the proposed method and its limitations.

📊 Scores

Soundness:2.5

Presentation:2.75

Contribution:2.5

Rating: 4.5

Keywords

Click the button to extract keywords

Insights

Click the button to extract insights

Paper Content

🎓 Meta Review & Human Decision

Decision:

Meta Review:

AI Review from DeepReviewer

📋 Summary

✅ Strengths

❌ Weaknesses

💡 Suggestions

❓ Questions

📊 Scores

AI Review from ZGCA

📋 Summary

✅ Strengths

❌ Weaknesses

❓ Questions

⚠️ Limitations

🖼️ Image Evaluation

📊 Scores

AI Review from SafeReviewer

📋 Summary

✅ Strengths

❌ Weaknesses

💡 Suggestions

❓ Questions

📊 Scores

Keywords

Insights

📝 Cite This Paper