2510.0018 Adaptive Evidential Meta-Learning with Hyper-Conditioned Priors for Calibrated ECG Personalisation v3

🎯 ICAIS2025 Accepted Paper

🎓 Meta Review & Human Decision

Decision:

Accept

Meta Review:

AI Review from DeepReviewer

AI Review available after:
--d --h --m --s

📋 AI Review from DeepReviewer will be automatically processed

📋 Summary

This paper introduces Adaptive Evidential Meta-Learning (AEML), a novel framework designed to enhance ECG model personalization by dynamically adjusting evidential priors based on patient-specific statistics. The core idea revolves around integrating a lightweight evidential head with a hypernetwork that conditions these priors, all while leveraging a frozen ECG foundation model. This approach aims to achieve rapid adaptation to individual patient characteristics while maintaining well-calibrated uncertainty estimates, a critical aspect for clinical applications. The methodology employs a two-stage meta-curriculum training strategy, beginning with high-quality clinical data and progressing to noisy real-world data, to enhance the model's robustness against domain shifts. The authors demonstrate through comprehensive experiments across multiple ECG datasets that AEML achieves superior accuracy, lower calibration error, and improved robustness compared to existing methods. The framework's lightweight design and computational efficiency make it suitable for real-time clinical deployment, a crucial factor for practical healthcare applications. The use of robust statistical measures, specifically the median and median absolute deviation (MAD), for conditioning the evidential priors is a key aspect of the approach, aiming to mitigate the impact of outliers and noise in patient-specific data. The authors validate their approach through extensive experiments, including ablation studies that highlight the contribution of each component. The results show that the adaptive prior mechanism, the two-stage curriculum, and the use of robust statistics all contribute to the overall performance of the model. The paper's significance lies in its ability to address the critical need for uncertainty-aware predictions in clinical settings, while also providing a computationally efficient solution for real-time deployment. The combination of evidential deep learning, meta-learning, and hypernetworks, tailored for ECG personalization, represents a notable contribution to the field. The authors also provide a detailed analysis of the model's robustness under different noise conditions and varying signal-to-noise ratios (SNR), demonstrating its ability to handle real-world ECG data quality variations. The paper's findings suggest that AEML is a promising approach for enhancing the reliability and accuracy of ECG-based clinical decision-making, particularly in scenarios where patient-specific adaptation and uncertainty quantification are essential.

✅ Strengths

The primary strength of this paper lies in its innovative integration of evidential deep learning with meta-learning and hypernetworks, specifically tailored for ECG personalization. This combination is not only novel but also addresses a critical need for uncertainty-aware predictions in clinical settings, where the reliability of model outputs is paramount. The use of a lightweight evidential head, conditioned by a hypernetwork, allows for adaptive uncertainty quantification, which is crucial for clinical applications where patient-specific characteristics can significantly impact model performance. The framework's ability to dynamically adjust evidential priors based on patient-specific statistics, computed using robust measures like the median and MAD, is a significant technical innovation. This approach ensures that the model is not only accurate but also provides well-calibrated uncertainty estimates, which are essential for building trust in AI-driven clinical decision support systems. The two-stage meta-curriculum training strategy is another notable strength. By first training on high-quality clinical data and then progressing to noisy real-world data, the model becomes more robust against domain shifts, a common challenge in real-world clinical deployments. This curriculum learning approach enhances the model's ability to generalize to diverse ECG data, making it more practical for real-world applications. The experimental validation is comprehensive, with evaluations across multiple ECG datasets and metrics, including accuracy, Expected Calibration Error (ECE), and out-of-distribution (OOD) detection capabilities. The results consistently demonstrate that AEML achieves superior accuracy, lower calibration error, and improved robustness compared to existing methods. The authors also provide a detailed analysis of the model's robustness under different noise conditions and varying signal-to-noise ratios (SNR), demonstrating its ability to handle real-world ECG data quality variations. The framework's lightweight design and computational efficiency are also significant strengths. The use of a frozen foundation model with a lightweight adaptation module ensures that the method is suitable for real-time deployment, a crucial aspect for healthcare applications where timely decision-making is critical. The paper also includes ablation studies that demonstrate the contribution of each component, further strengthening the validity of the proposed approach. The authors have clearly shown that the adaptive prior mechanism, the two-stage curriculum, and the use of robust statistics all contribute to the overall performance of the model. Finally, the paper's focus on practical applicability, particularly in resource-limited healthcare environments, makes it a valuable contribution to the field.

❌ Weaknesses

While the paper presents a compelling approach, several weaknesses warrant careful consideration. Firstly, the reliance on a few-shot setting, specifically 5-shot learning, raises concerns about the method's practicality in scenarios where patient data is scarce. The paper explicitly states the use of 5-shot learning, where each task contains 5 samples per class from a single patient, and mentions handling cases with fewer than 3 samples by using global class statistics as a fallback. However, it lacks experiments exploring the impact of varying the number of shots, particularly with fewer than 3 samples per class. This is a significant limitation, as the performance of the hypernetwork is highly dependent on the quality and representativeness of these few samples, and the paper does not adequately address the potential for significant performance degradation when fewer samples are available. This lack of analysis makes it difficult to assess the model's reliability in real-world clinical settings where patient data is often limited. My analysis confirms this limitation, as the paper focuses on 5-shot learning without exploring the impact of different numbers of shots. Secondly, the paper's analysis of the model's robustness against noisy ECG data, while present, is not sufficiently detailed. Although a two-stage meta-curriculum is introduced to address noisy data, the paper lacks a thorough analysis of the model's sensitivity to different types of noise (e.g., baseline wander, muscle artifacts, electrode motion) and how these noises impact the uncertainty estimates. While the paper mentions evaluating robustness under different noise conditions and SNR levels, it does not provide a detailed breakdown of performance metrics (accuracy, ECE) for each specific type of noise and varying levels of that noise. The impact of noise on uncertainty estimates is mentioned qualitatively but lacks quantitative analysis. This lack of detailed analysis makes it difficult to fully assess the model's resilience to real-world ECG data quality variations. My analysis confirms this, as the paper provides a general overview of robustness but lacks detailed quantitative results for each noise type and level. Thirdly, the theoretical justification for using robust statistics (median, MAD) for conditioning priors is limited. The paper describes the use of median and MAD as robust statistics but does not provide a formal theoretical justification for their use in conditioning priors, especially in few-shot scenarios. It also lacks a discussion of their potential limitations, such as their sensitivity to the distribution of the data. While the ablation study shows robust statistics outperform traditional ones, it doesn't provide a deep theoretical analysis of why they are reliable in this specific context. The empirical evidence is limited to a comparison with mean/variance, not a comprehensive analysis of the reliability of median/MAD themselves. This lack of theoretical backing weakens the claims about the reliability of the proposed approach. My analysis confirms this, as the paper uses median and MAD without providing a formal theoretical justification or discussing their limitations in the context of the proposed framework. Fourthly, the two-stage meta-curriculum assumes access to both high-quality and noisy data during training, which may not always be feasible in practice. The paper clearly outlines the two-stage meta-curriculum involving both high-quality and noisy data. However, it does not discuss the impact of the ratio between these data types or explore alternative training strategies. The paper assumes the availability of both data types without exploring alternatives or the impact of their proportion. This is a practical limitation, as obtaining both types of data may not always be possible in real-world scenarios. My analysis confirms this, as the paper describes a two-stage curriculum requiring both clean and noisy data without discussing the impact of their ratio or alternative strategies. Finally, the paper lacks a detailed explanation of the hypernetwork architecture and its training process. The paper describes the hypernetwork's function but lacks details about its architecture (number of layers, activation functions) and specific training process beyond the general objective. This lack of detail makes it difficult to fully understand how the priors are dynamically adjusted based on patient-specific data. My analysis confirms this, as the paper describes the hypernetwork's role but omits details about its architecture and specific training process. Additionally, the paper states the use of a KL regularization weight of 0.1 but does not provide a sensitivity analysis or justification for this specific value, nor does it discuss the need for tuning across different datasets. This lack of analysis makes it difficult to assess the impact of this hyperparameter on the model's performance. My analysis confirms this, as the paper states the use of XKL = 0.1 without providing a sensitivity analysis or justification for this value across different datasets. These weaknesses, while not invalidating the paper's contributions, highlight areas where further investigation and analysis are needed to fully assess the practical applicability and robustness of the proposed approach.

💡 Suggestions

To address the identified weaknesses, several concrete improvements can be made. Firstly, to mitigate the limitations regarding the few-shot setting, the authors should conduct a more thorough investigation of the model's performance under varying numbers of patient-specific samples, including scenarios with fewer than 5 samples. This analysis should include a sensitivity study to determine how the model's accuracy and uncertainty estimates are affected by the scarcity of data. Specifically, experiments should be conducted with 1-shot, 3-shot, and potentially even zero-shot scenarios to establish a clear understanding of the model's behavior in data-scarce settings. Furthermore, the authors should explore techniques to mitigate the impact of limited data, such as incorporating data augmentation strategies or using semi-supervised learning methods to leverage unlabeled data. It would also be beneficial to compare the performance of their method against other few-shot learning techniques to better understand its relative strengths and weaknesses in data-scarce scenarios. This would provide a more comprehensive understanding of the model's practical applicability in clinical settings where patient data is often limited. Secondly, to improve the robustness of the framework against noisy ECG data, the authors should conduct a more detailed analysis of the model's performance under different types of noise and varying noise levels. This analysis should include a quantitative evaluation of how different noise types (e.g., baseline wander, muscle artifacts, electrode motion) affect the model's accuracy and uncertainty estimates. The authors should also explore techniques to enhance the model's noise robustness, such as incorporating noise-aware training strategies or using denoising techniques as a preprocessing step. Additionally, the paper should investigate how the uncertainty estimates correlate with the noise level, which would provide a more comprehensive understanding of the model's reliability in real-world scenarios. This would help to establish the model's practical utility in clinical environments where ECG data is often noisy and unreliable. Thirdly, to strengthen the theoretical justification for using robust statistics, the authors should provide a more rigorous analysis of the properties of median and MAD in the context of few-shot learning. This analysis should include a discussion of the potential limitations of these statistics, such as their sensitivity to the distribution of the data, and how these limitations might affect the model's performance. The authors should also explore alternative robust statistics and compare their performance with the proposed approach. Furthermore, the paper should provide a theoretical analysis of how these statistics are used to condition the priors and how this conditioning affects the uncertainty estimates. This would provide a more solid theoretical foundation for the proposed method and increase its credibility. Fourthly, to address the practical limitations of the two-stage meta-curriculum, the authors should explore alternative training strategies that do not require both clean and noisy data. This could involve techniques such as domain generalization or adversarial training to make the model more robust to domain shifts without relying on noisy data during training. Additionally, the paper should investigate the impact of the ratio of clean to noisy data on the model's performance and explore methods to optimize this ratio. This would provide a more practical and flexible approach for real-world scenarios where access to both types of data may not be feasible. Finally, the authors should provide a more detailed explanation of the hypernetwork architecture, including the number of layers, activation functions, and the specific parameters being generated for the evidential head. It would be beneficial to include a diagram illustrating the hypernetwork's structure and its interaction with the evidential head. Furthermore, the training process of the hypernetwork should be elaborated, detailing the loss function used, the optimization algorithm, and the learning rate schedule. This level of detail is necessary to fully understand how the priors are dynamically adjusted based on patient-specific data. Additionally, the authors should conduct a sensitivity analysis of the KL regularization weight (XKL) to determine its impact on the model's performance and to identify the optimal value for different datasets. This would provide a more comprehensive understanding of the model's behavior and its sensitivity to hyperparameter settings.

❓ Questions

Several key questions arise from my analysis of this paper. Firstly, how does the model perform when fewer than 3 patient-specific samples are available? The paper focuses on 5-shot learning and mentions using global statistics for cases with fewer than 3 samples, but it does not provide any experimental results or analysis for such scenarios. Is there a threshold below which the model's uncertainty estimates become unreliable, and how does the model's performance degrade as the number of samples decreases? Secondly, how does the model handle different types of noise in ECG signals, and is there a threshold of noise level beyond which the model's performance degrades significantly? The paper mentions evaluating robustness under different noise conditions and SNR levels, but it lacks a detailed quantitative analysis of the impact of each specific noise type and level on performance metrics. What are the specific performance metrics (accuracy, ECE) for each noise type at different levels, and how do these noise levels affect the model's uncertainty estimates? Thirdly, can the authors provide a theoretical analysis or empirical evidence to support the reliability of robust statistics (median, MAD) for conditioning priors in few-shot scenarios? The paper uses median and MAD as robust statistics but does not provide a formal theoretical justification for their use in this specific context. What are the potential limitations of using median and MAD, such as their sensitivity to the distribution of the data, and how do these limitations affect the model's performance? Fourthly, how does the ratio of clean to noisy data in the two-stage meta-curriculum affect the model's performance? The paper does not specify the ratio of tasks from Stage 1 to Stage 2 used during training, nor does it provide any analysis on how this ratio affects performance. Is there an optimal ratio for achieving the best balance between accuracy and robustness, and how can this ratio be determined? Finally, how does the KL regularization weight (XKL) impact the model's performance, and is XKL = 0.1 optimal across all datasets? The paper states that XKL = 0.1 was used but does not provide a sensitivity analysis or justification for this specific value. Does this value need to be tuned for each dataset, and what is the impact of different values on the model's performance and uncertainty estimates?

📊 Scores

Soundness:2.75
Presentation:2.75
Contribution:2.75
Rating: 5.25

AI Review from ZGCA

ZGCA Review available after:
--d --h --m --s

📋 AI Review from ZGCA will be automatically processed

📋 Summary

The paper proposes Adaptive Evidential Meta-Learning (AEML) for ECG personalization. A frozen ECG foundation model provides features; a lightweight evidential head outputs Dirichlet parameters for classification and uncertainty; and a hypernetwork conditions class-conditional priors from a few patient-specific samples using robust feature-space statistics (median and MAD) (Sec. 4.5–4.7; Eqs. 11–13). Training uses a two-stage meta-curriculum: Stage 1 on clean clinical tasks and Stage 2 on noisy, real-world variants (Sec. 4.6; Eq. 10), optimizing an evidential loss with KL regularization toward the hypernetwork-generated priors (Sec. 4.4; Eq. 6). Experiments across synthetic, clinical (MIT-BIH, CPSC2018), and wearable datasets compare against fine-tuning, LoRA, MAML, Proto/MatchNet, ECG-specific meta-learning, post-hoc calibration (temperature scaling, isotonic regression), and uncertainty baselines (MC Dropout, Ensembles) (Sec. 6.2). The method reports lower ECE, higher accuracy, improved OOD detection (using K / sum alpha as score), and better computational efficiency due to the frozen backbone (Sec. 6.4; Table 1). Limitations acknowledge the theoretical fragility of few-shot prior conditioning (Sec. 8).

✅ Strengths

  • Addresses a clinically important gap: uncertainty calibration in personalized ECG models (Intro; Sec. 1), with a design focused on real-time applicability (Sec. 4.9–4.10).
  • Novel integration of evidential deep learning with hypernetwork-conditioned, class-conditional priors derived from robust few-shot feature statistics (Sec. 4.5–4.7; Eqs. 11–13).
  • Two-stage meta-curriculum explicitly targets robustness to domain shift and noise (Sec. 4.6), with reported cross-domain improvements on wearable ECGs (Sec. 6.4; Fig. 3).
  • Broad and relevant baselines, patient-level splits, 5 seeds, and statistical testing with Bonferroni correction (Sec. 5; Sec. 6.2).
  • Demonstrated computational efficiency vs. full fine-tuning and LoRA (Table 1; Sec. 6.4) with a frozen backbone and lightweight adaptation modules.
  • Qualitative clinical insight: high uncertainty correlates with irregular R–R intervals, morphological variability, and degraded signal quality (Sec. 6.5).
  • Transparent limitations, including the risk that few-shot statistics can miscalibrate priors when samples are unrepresentative (Sec. 8).

❌ Weaknesses

  • Core theoretical fragility: conditioning priors from few-shot robust statistics is acknowledged as not theoretically justified and potentially miscalibrating (Sec. 8). This is critical for clinical reliability.
  • Potential loss inconsistency: L_evidential in Eq. (6) already includes a KL(Dir(alpha) || Dir(alpha0)) term, but Eq. (14) appears to add another KL term, possibly double-counting the regularizer. This needs clarification.
  • Hypernetwork/prior details are underspecified: input dimensionality and encoding of class-wise medians/MADs, architecture/parameter count of the hypernetwork, normalization to ensure valid and well-scaled Dirichlet priors, and how priors vary across classes are not sufficiently detailed (Sec. 4.5–4.7).
  • Evaluation reporting lacks quantitative depth in calibration and OOD: ECE uses equal-width binning (Sec. 5), which is known to be brittle; no adaptive/threshold-free calibration metrics (e.g., ACE, ECE variants) or sensitivity analyses are reported. OOD detection is described (K/sum alpha) but no AUROC/AUPR numbers or confidence intervals are provided (Sec. 5–6.4).
  • Dataset preprocessing/harmonization details are sparse: mapping of original labels to the five arrhythmia classes, lead configurations, sampling rates, and specifics of the wearable dataset(s) are not fully described (Sec. 5–6).
  • Inconsistencies and minor errors: the backbone is described as a CNN in Sec. 4.1 but feature extraction mentions recurrent layers (Sec. 4.2). Equation (10) has typographical issues and the noise curriculum schedule is described but not fully specified (Sec. 4.6).
  • Model selection based on lowest validation ECE (Sec. 5) may bias caloric comparisons; sensitivity to alternative selection criteria (e.g., accuracy, calibration-accuracy trade-off) is not discussed.
  • Reproducibility constraints: pretraining data for the foundation model (1.2M samples, 50k patients) may not be accessible; code/resources are not mentioned. Also, specific random seeds for main runs are not enumerated (Sec. 5).
  • Clinical practicality: the assumption of at least 3–5 labeled samples per class per patient (Sec. 4.7) may be unrealistic in practice; the fallback to “global class statistics” is not clearly defined (source, domain, and how it is computed without leakage).

❓ Questions

  • Loss definition: Eq. (6) includes a KL(Dir(alpha) || Dir(alpha0)) term, and Eq. (14) appears to add another KL term on top of L_evidential. Is this double-counting intentional? If so, how are the two KL terms weighted and what is the rationale?
  • Hypernetwork details: Please specify the exact architecture (layers, activations, parameter count), input encoding (how class-wise medians/MADs are concatenated/pooled; feature dimensionality), and the output parameterization of alpha0 (per-class vector? scalar per class?).
  • Prior constraints: How exactly do you ensure alpha0 > 0 and appropriate scaling? What normalization is applied (e.g., softplus + temperature scaling, or sum constraints)? Any sensitivity analysis on prior magnitude/scale?
  • Few-shot statistics: How sensitive is the method to mislabeled or unrepresentative few-shot examples? Can you report calibration and accuracy as a function of support set corruption or class imbalance beyond the at-least-3-samples assumption?
  • Fallback priors: For classes with fewer than 3 samples, you use global class statistics. From which data/domain are these computed (training only?), and how do you prevent leakage from target/test patients? Please detail the computation and storage of global statistics.
  • Backbone description: Sec. 4.1 describes a CNN with residual blocks and FC layers; Sec. 4.2 mentions recurrent layers. Which is correct in your implementation? If recurrent layers are used, please provide architecture and hyperparameters.
  • Calibration metrics: Beyond equal-width ECE (15 bins), can you report adaptive ECE/ACE and class-wise ECE, as well as calibration curves with confidence intervals? Do your conclusions hold under alternative binning schemes?
  • OOD detection: Please report quantitative OOD metrics (AUROC/AUPR, FPR@TPR95) across in-/out-of-distribution pairs. How do those compare to MSP and other evidential uncertainty measures (e.g., entropy, mutual information)?
  • Curriculum details: How is the noise schedule parameterized over training (per-epoch or per-task)? Does Stage 2 interleave clean/noisy tasks or exclusively noisy? Any evidence of catastrophic forgetting from Stage 1?
  • Model selection: You select models by lowest validation ECE. How sensitive are the results to selecting by accuracy or by a composite metric (e.g., ECE + accuracy)?
  • Data specifics: Please provide details on lead configuration, sampling rate, and label mapping for MIT-BIH and CPSC2018 to the five chosen classes. For wearable datasets, which devices and preprocessing steps are used?
  • Reproducibility: Will you release code, the pretrained foundation model, and scripts for computing robust statistics and priors? If the 1.2M-sample pretraining corpus is proprietary, can you release a model trained on public data?
  • Runtime/device constraints: Inference time is reported on an RTX 3080. What are latency and memory footprints on clinical edge devices (e.g., CPU-only or mobile GPUs)?

⚠️ Limitations

  • Theoretical: Conditioning priors via few-shot robust statistics lacks guarantees and can miscalibrate under unrepresentative support sets (Sec. 8).
  • Practical data needs: Requiring 3–5 labeled samples per class per patient (Sec. 4.7) may be unrealistic; obtaining such labels has clinical workflow costs.
  • Evaluation metrics: Equal-width ECE is known to be unstable; additional calibration metrics and sensitivity analyses would strengthen the claims.
  • OOD/generalization: OOD performance is discussed but not reported quantitatively; deployment to unseen devices/settings may yield different distributions of artifacts than simulated (Sec. 4.6).
  • Reproducibility: Reliance on a large pretrained foundation model may limit adoption unless weights or a public alternative are released.
  • Potential societal impacts: Miscalibration in high-stakes settings could lead to overconfidence or excessive deferrals; demographic biases in datasets (e.g., age, sex, device/clinical site) could harm underrepresented groups if not analyzed.
  • Privacy: Patient-specific statistics require careful handling to avoid leakage; deployment must consider data governance for per-patient adaptation.

🖼️ Image Evaluation

Cross‑Modal Consistency: 16/50

Textual Logical Soundness: 12/30

Visual Aesthetics & Clarity: 12/20

Overall Score: 40/100

Detailed Evaluation (≤500 words):

1. Cross‑Modal Consistency

• Major 1: Reported accuracies conflict across text and figures. Evidence: Table 1 “Ours … Accuracy 90.1 ± 0.8” vs Fig. 1(a) and per‑dataset accuracy plots showing ≤0.60 (often ~0.3–0.4).

• Major 2: Core ECE comparison across methods is referenced but missing. Evidence: “Figure 6 presents ECE comparison across methods…” (Sec 6.4); no Fig. 6 provided.

• Major 3: Ablation contradicts claim. Evidence: Fig. 2(b) bars show Class‑Conditional ECE higher than Baseline for most datasets, while caption states it “significantly reduce[s] calibration error.”

• Major 4: Metric definition mismatch. Evidence: Sec 5: “conf = E[max_k p_k] = α_max/Σα_j” — E[max] ≠ max(E[p_k]); this affects all ECE numbers.

• Major 5: Loss term duplicated. Evidence: Eq. (6) already includes KL(Dir(α)||Dir(α0)); Eq. (14) adds “+ λ_KL · KL(Dir(α_i)||Dir(α0,i))” again.

• Minor 1: Architecture inconsistency. Evidence: Sec 4.1 “12‑layer convolutional neural network” vs Sec 4.2 “series of convolutional and recurrent layers.”

• Minor 2: Extra, unlabeled/supplementary plots (e.g., “Final ECE … (Baseline)” showing near‑zero ECE for most datasets) are not referenced and conflict with other figures.

2. Text Logic

• Major 1: Central claim of superior calibration lacks consistent, verifiable evidence. Evidence: Sec 6.4 claims “significantly lower calibration error (p<0.01)” but cross‑method figure is missing and ablation Fig. 2(b) suggests the opposite.

• Minor 1: Inference claims “single forward pass” while adaptation requires few‑shot statistics per patient; clarify per‑patient precomputation. Evidence: Sec 4.10 and Sec 4.9.

• Minor 2: Two‑stage curriculum described, but scheduling specifics (epochs/ratio/transition rule) absent. Evidence: Sec 4.6 provides ranges but no concrete schedule.

3. Figure Quality

• Major 1: Figure‑text identity/confusion in ablations; colors/legends suggest Baseline < Class‑Conditional, opposite to caption. Evidence: Fig. 2(b) bars and caption text in Fig. 2.

• Minor 1: Many small plots are duplicated/unindexed; numbering breaks flow (e.g., additional accuracy/loss panels without figure numbers).

• Minor 2: Some legends overlap content; small fonts on axis ticks in multi‑panel accuracy/loss plots hinder quick reading at print size.

Key strengths:

  • Clear problem motivation (ECG personalization with calibrated uncertainty).
  • Sensible architecture idea: hyper‑conditioned evidential priors from robust statistics.
  • Practical training/inference efficiency focus; Table 1 provides FLOPs and latency.

Key weaknesses:

  • Severe cross‑modal inconsistencies (metrics, figures, and claims).
  • Incorrect confidence definition for ECE; duplicated KL loss.
  • Missing core evidence (Fig. 6), contradictory ablation outcomes, and confusing supplemental plots.
  • Architectural description inconsistency (conv vs conv+RNN).

Recommendations:

  • Fix ECE confidence definition and recompute all calibration results.
  • Provide the missing cross‑method figures and harmonize numbers with Table 1.
  • Correct ablation captions or plots; ensure Class‑Conditional truly improves ECE.
  • Clarify architecture (remove RNN mention or specify exact layers) and loss formulation (single KL term).
  • Consolidate and renumber figures; add per‑figure legends/labels to pass a “figure‑alone” test.

📊 Scores

Originality:3
Quality:3
Clarity:2
Significance:3
Soundness:3
Presentation:2
Contribution:3
Rating: 5

AI Review from SafeReviewer

SafeReviewer Review available after:
--d --h --m --s

📋 AI Review from SafeReviewer will be automatically processed

📋 Summary

This paper introduces Adaptive Evidential Meta-Learning (AEML), a framework designed to enhance ECG model personalization by incorporating uncertainty quantification. The authors propose a method that leverages a pre-trained ECG foundation model, keeping its parameters frozen, and attaches a lightweight evidential head. This head outputs parameters for a Dirichlet distribution, which is used to model both aleatoric and epistemic uncertainty. A key innovation is the use of a hypernetwork that generates the parameters for the evidential head's prior distribution, conditioned on robust, class-conditional statistics computed from a few patient-specific ECG samples. This allows the model to adapt to individual patient characteristics while maintaining computational efficiency. The training process is structured as a two-stage meta-curriculum, where the model first learns from high-quality clinical data and then adapts to noisy real-world data. The authors evaluate their approach on several datasets, including clinical, synthetic, and wearable ECG data, demonstrating improvements in Expected Calibration Error (ECE), accuracy, and out-of-distribution (OOD) detection compared to several baselines, including full fine-tuning, LoRA, and conventional meta-learning approaches. The paper's main contribution lies in the integration of evidential learning with a hypernetwork conditioned on patient-specific statistics within a meta-learning framework, specifically tailored for ECG personalization. The results suggest that the proposed method is effective in providing well-calibrated uncertainty estimates while maintaining high accuracy, which is crucial for clinical applications. The authors also provide a detailed analysis of the computational efficiency of their approach, showing that it achieves a good balance between performance and computational cost. The paper's focus on uncertainty calibration in the context of ECG personalization is a significant contribution, addressing a critical gap in existing methods. The two-stage meta-curriculum also addresses the challenge of domain shift, which is common in real-world clinical settings. Overall, the paper presents a well-motivated and empirically validated approach to ECG model personalization, with a focus on uncertainty awareness and computational efficiency.

✅ Strengths

I find several aspects of this paper to be particularly strong. The core idea of combining evidential learning with a hypernetwork conditioned on patient-specific statistics is a novel and promising approach to ECG model personalization. This integration allows for uncertainty-aware predictions while maintaining computational efficiency, which is a critical consideration for clinical applications. The use of a hypernetwork to generate priors for the evidential head, based on robust, class-conditional statistics, is a clever way to adapt the model to individual patient characteristics using only a few samples. This addresses a key limitation of many existing methods that require large amounts of patient-specific data for fine-tuning. The two-stage meta-curriculum training strategy is another strength of the paper. By first training on high-quality clinical data and then adapting to noisy real-world data, the model becomes more robust to domain shifts, which is a common challenge in real-world ECG analysis. The empirical results presented in the paper are compelling. The authors demonstrate significant improvements in Expected Calibration Error (ECE), accuracy, and out-of-distribution (OOD) detection capabilities compared to several baselines. These results are consistent across multiple datasets, including clinical, synthetic, and wearable ECG data, which strengthens the generalizability of the findings. The inclusion of a computational efficiency analysis is also a positive aspect of the paper. The authors show that their method achieves a good balance between performance and computational cost, making it suitable for real-time clinical deployment. The paper also includes ablation studies that demonstrate the contribution of each component of the proposed framework, which provides valuable insights into the effectiveness of the approach. Finally, the paper addresses a critical gap in existing methods by focusing on uncertainty calibration, which is often overlooked in ECG model personalization. This is a significant contribution, as well-calibrated uncertainty estimates are crucial for building trustworthy machine learning systems in healthcare.

❌ Weaknesses

Despite the strengths of this paper, I have identified several weaknesses that warrant further attention. First, the paper's reliance on a 5-shot learning scenario for adapting to new patients is a significant limitation. While the authors demonstrate the effectiveness of their approach in this setting, they do not adequately address scenarios where fewer than 5 samples are available, which is common in real-world clinical practice. The paper lacks a discussion on how the uncertainty estimates behave with fewer samples, and whether the model's confidence remains reliable. This is a critical issue, as the robustness of the uncertainty quantification with varying numbers of samples is essential for the practical applicability of the method. The paper also does not explore the impact of highly variable or noisy initial samples on the adaptation process. In real-world settings, the initial samples used for adaptation may not be representative of the patient's overall ECG profile, and the paper does not investigate how this affects the model's performance and uncertainty estimates. This is a crucial limitation, as the method's reliance on the quality of the initial samples could limit its robustness in clinical settings. My analysis also reveals that the paper's description of the OOD detection process is not entirely clear. While the paper states that the threshold for OOD detection is determined using the validation set, it does not provide specific details on how this threshold is set. The paper also lacks a clear explanation of how the OOD score is calculated and how it relates to the uncertainty estimates. This lack of clarity makes it difficult to fully understand the OOD detection process and its effectiveness. Furthermore, the paper's presentation of results in Figure 2 is not entirely clear. While the caption provides some explanation, the initial interpretation of the figure as an ablation study of different components of the proposed method is not immediately obvious. The paper could benefit from a more detailed explanation of the figure's content and the specific comparisons being made. Additionally, the paper's description of the baseline methods is not as detailed as it could be. While the paper mentions the baselines used, it does not provide sufficient information on their specific architectures and training procedures. This lack of detail makes it difficult to fully assess the novelty and contribution of the proposed method compared to existing approaches. The paper also lacks a detailed discussion of the specific challenges in ECG model personalization that motivated the development of the proposed method. While the paper mentions the limitations of existing methods, it does not provide a comprehensive analysis of the specific issues that the proposed method aims to address. This lack of context makes it difficult to fully appreciate the significance of the paper's contribution. Finally, the paper does not provide a detailed analysis of the impact of different types of noise on the model's performance. While the paper mentions the use of a noise model, it does not provide a specific analysis of how different noise types affect the accuracy and calibration of the model. This is a critical limitation, as the robustness of the method to different types of noise is essential for its practical applicability in real-world clinical settings. The paper also lacks a discussion of the potential impact of the hypernetwork's capacity on the results. While the authors mention the size of the hypernetwork, they do not explore the impact of different hypernetwork architectures or sizes on the performance of the proposed method. This is a crucial aspect that needs further investigation, as the hypernetwork's capacity could significantly affect the model's ability to adapt to different tasks and datasets. The paper also does not provide a detailed analysis of the computational cost of the proposed method compared to other meta-learning approaches. While the paper includes a computational efficiency analysis, it does not provide a comprehensive comparison with other meta-learning methods. This is a critical limitation, as the computational cost of the method is an important factor for its practical applicability. The paper also does not provide a detailed analysis of the model's performance on different types of ECG abnormalities. While the paper mentions the arrhythmia types considered, it does not provide a breakdown of the model's performance on each type. This is a crucial limitation, as the model's performance may vary across different types of abnormalities. Finally, the paper does not provide a detailed analysis of the model's performance on different patient populations. While the paper mentions the datasets used, it does not provide a breakdown of the model's performance on different patient groups. This is a critical limitation, as the model's performance may vary across different patient populations.

💡 Suggestions

To address the identified weaknesses, I recommend several concrete improvements. First, the authors should conduct a more thorough investigation into the behavior of the uncertainty estimates with varying numbers of samples per class, particularly in low-data regimes. Specifically, they should evaluate the model's performance with 1, 2, and 3 samples per class, and analyze how the uncertainty estimates change in these scenarios. This analysis should include not only the Expected Calibration Error (ECE) but also visualizations of calibration curves and reliability diagrams to provide a more granular view of the model's confidence. Furthermore, it would be beneficial to explore the sensitivity of the model to the specific samples chosen for adaptation. For example, the authors could investigate how the uncertainty estimates vary when using different combinations of samples from the same patient, which would provide insights into the robustness of the adaptation process. To address the concern about the representativeness of the initial samples, the authors should conduct experiments using samples with significant morphological variations or those that are known to be noisy. This could involve manually selecting samples with atypical ECG patterns or introducing synthetic noise to the initial samples. The analysis should focus on how the model's uncertainty estimates respond to these challenging inputs. For instance, does the model exhibit higher uncertainty when presented with noisy or atypical samples, and does this uncertainty correlate with the actual error in prediction? Furthermore, the authors should explore strategies to mitigate the impact of unrepresentative initial samples, such as using a more robust statistic than the median for prior calculation or incorporating a mechanism to detect and down-weight unreliable initial samples during adaptation. The authors should provide a more detailed explanation of the OOD detection process, including the specific datasets used for OOD evaluation and the criteria for selecting these datasets. It is crucial to clarify how the threshold for OOD detection is determined using the validation set and how this threshold generalizes to unseen data. The authors should also provide a more detailed explanation of the figure's content and the specific comparisons being made in Figure 2. This could involve adding annotations to the figure or providing a more detailed description in the text. The authors should also provide more details on the baseline methods, including their specific architectures and training procedures. This would allow for a more comprehensive comparison of the proposed method with existing approaches. The authors should also provide a more detailed discussion of the specific challenges in ECG model personalization that motivated the development of the proposed method. This would provide a better context for the paper's contribution. The authors should also conduct a more detailed analysis of the impact of different types of noise on the model's performance. This analysis should include a comparison of the model's performance under different noise conditions and an investigation of the model's ability to detect OOD samples under noisy conditions. The authors should also explore the impact of the hypernetwork's capacity on the results by experimenting with different hypernetwork architectures and sizes. This would provide insights into the optimal hypernetwork configuration for the proposed method. The authors should also provide a more detailed analysis of the computational cost of the proposed method compared to other meta-learning approaches. This analysis should include a comparison of the training time, inference time, and memory usage of the proposed method with other meta-learning methods. The authors should also provide a more detailed analysis of the model's performance on different types of ECG abnormalities. This analysis should include a breakdown of the model's performance on each type of abnormality, as well as an analysis of the model's uncertainty estimates for each type. Finally, the authors should provide a more detailed analysis of the model's performance on different patient populations. This analysis should include a breakdown of the model's performance on different patient groups, as well as an analysis of the model's uncertainty estimates for each group.

❓ Questions

Based on my analysis, I have several questions that I believe are important for further clarification. First, how does the model perform when only 1 or 2 samples are available for adaptation, and how do the uncertainty estimates behave in these low-data scenarios? This is a critical question, as the practical applicability of the method depends on its ability to perform well with limited data. Second, how does the model respond to initial samples that are highly variable or noisy, and does the uncertainty increase appropriately in these cases? This is important for understanding the robustness of the method in real-world clinical settings. Third, what is the specific method used to set the threshold for OOD detection, and how does this threshold generalize to unseen data? This is crucial for understanding the effectiveness of the OOD detection process. Fourth, what is the impact of different types of noise on the model's performance, and how does the model's ability to detect OOD samples change under noisy conditions? This is important for understanding the robustness of the method to different types of noise. Fifth, what is the impact of the hypernetwork's capacity on the results, and what is the optimal hypernetwork configuration for the proposed method? This is important for understanding the sensitivity of the method to the hypernetwork's architecture. Sixth, how does the computational cost of the proposed method compare to other meta-learning approaches, and what are the trade-offs between performance and computational cost? This is important for understanding the practical applicability of the method. Seventh, how does the model perform on different types of ECG abnormalities, and does the model's performance vary across different types of abnormalities? This is important for understanding the model's strengths and weaknesses. Finally, how does the model perform on different patient populations, and does the model's performance vary across different patient groups? This is important for understanding the model's generalizability and potential biases.

📊 Scores

Soundness:2.5
Presentation:2.25
Contribution:2.5
Rating: 5.5

Keywords

Click the button to extract keywords

Insights

Click the button to extract insights
Version 3
Citation Tools

📝 Cite This Paper