📋 AI Review from DeepReviewer will be automatically processed
📋 AI Review from ZGCA will be automatically processed
The paper proposes Adaptive Evidential Meta-Learning (AEML) for uncertainty-aware ECG personalization. A frozen ECG foundation model provides features, a lightweight evidential head outputs Dirichlet evidence, and a hypernetwork conditions the Dirichlet prior on robust class-conditional statistics (median and MAD) computed from few-shot patient-specific samples (Sections 4.5, 4.7; Eqs. 7–9, 11–13). Training uses a two-stage meta-curriculum (Section 4.6, Eq. 10) that first learns on clean clinical tasks and then progressively introduces noise to improve robustness to real-world artifacts. The evidential objective includes an NLL term with KL regularization to the hyper-conditioned prior (Eq. 6). Experiments across synthetic, clinical (MIT-BIH, CPSC2018), and wearable ECG datasets compare against fine-tuning, LoRA, MAML, and post-hoc calibration (temperature scaling, isotonic). Results show lower ECE and competitive-to-better accuracy, improved cross-domain performance, and better computational efficiency (Table 1). Ablations isolate gains from the hypernetwork, robust statistics, and the curriculum (Figure 2).
Cross‑Modal Consistency: 26/50
Textual Logical Soundness: 18/30
Visual Aesthetics & Clarity: 13/20
Overall Score: 57/100
Detailed Evaluation (≤500 words):
1. Cross‑Modal Consistency
• Visual ground truth
– Figure 1/(a): Line plot, Accuracy vs Epoch (train/val); val>train trend.
– Figure 1/(b): Line plot, Loss vs Epoch (train/val); both decrease and stabilise.
– Figure 2/(a): Bar chart ECE across datasets, Shared vs Independent heads.
– Figure 2/(b): Bar chart ECE, Class‑Conditional prior vs Baseline.
– Figure 3: Bar chart “Final ECE Across Multiple Datasets.”
– Table 1: FLOPs, inference time, accuracy for four methods.
– Additional unnumbered panels: per‑dataset accuracy/loss curves; a “Baseline Final ECE” bar chart with near‑zero values.
• Major 1: Unresolved reference to “Figure ??,” breaking method‑figure linkage. Evidence: Sec 4.1 “illustrated in Figure ??”.
• Major 2: Ablation contradicts prose; class‑conditional prior often worse than baseline. Evidence: Fig. 2/(b) bars higher for three datasets.
• Major 3: Extra “Baseline Final ECE” shows zeros for most datasets, conflicting with claims and other figures. Evidence: Panel titled “Final ECE Comparison Across Datasets (Baseline)”.
• Major 4: “Zero‑shot adaptation” stated despite few‑shot prior computation requirement. Evidence: Sec 6 “Zero-shot adaptation…” vs Sec 4.7 “compute robust statistics from patient-specific samples”.
• Minor 1: OOD score defined as maximum softmax despite Dirichlet outputs; mapping not explained. Evidence: Sec 5.2 “maximum softmax probability as the OOD score”.
• Minor 2: Some sub‑figure labels (a/b) not embedded on the plots; reliance on caption position. Evidence: Fig. 2 caption vs panes.
2. Text Logic
• Major 1: Statistical significance (p<0.01) claimed without error bars, CIs, or exact p‑values per comparison. Evidence: Sec 6.4 “(p < 0.01)” with no plotted intervals.
• Major 2: Curriculum benefit claim vs. Fig. 2 ablations not consistently supporting decreased ECE. Evidence: Fig. 2/(a,b) trends mixed.
• Minor 1: Notation reuse of x as signal and per‑class sets without dimensional clarification. Evidence: Secs 4.5–4.7 eqs. (7–13).
• Minor 2: CPSC2018 cited via a 2025 survey, not the dataset source. Evidence: Sec 5 “CPSC2018 (Wan et al., 2025)”.
3. Figure Quality
• Major 1: Inclusion of many unreferenced small panels dilutes message, risks confusion. Evidence: Multiple per‑dataset accuracy/loss plots lacking figure numbers.
• Minor 1: Small fonts on axes/ticks in several panels may be hard at print size. Evidence: Per‑dataset 280–300 px plots.
• Minor 2 (Figure‑Alone test): Fig. 2 needs clearer legends for datasets/task grouping and explicit y‑axis units “ECE (0–1)”. Evidence: Fig. 2/(a,b) axes lack unit note.
Key strengths:
Key weaknesses:
Actionable fixes (highest impact first):
📋 AI Review from SafeReviewer will be automatically processed
This paper introduces Adaptive Evidential Meta-Learning (AEML), a novel framework designed to enhance the personalization of electrocardiogram (ECG) analysis models while providing well-calibrated uncertainty estimates. The core idea behind AEML is to leverage a pre-trained ECG foundation model, keeping its weights frozen, and then attach a lightweight evidential head that is conditioned on patient-specific information. This conditioning is achieved through a hypernetwork, which takes robust, class-conditional statistics derived from a few patient-specific ECG samples as input. The hypernetwork then generates parameters for the evidential head, allowing the model to adapt to individual patient characteristics. The evidential head, in turn, provides both class predictions and uncertainty estimates based on the Dirichlet distribution. The authors propose a two-stage meta-curriculum training strategy, where the model is first trained on high-quality clinical data and then on noisy real-world data, aiming to improve the model's robustness and generalization capabilities. The empirical evaluation of AEML includes experiments on both synthetic and real-world ECG datasets, demonstrating improvements in accuracy and uncertainty calibration compared to several baseline methods, including full fine-tuning, LoRA, and MAML. The authors also present ablation studies to analyze the contribution of different components of the proposed framework, such as the hypernetwork, the use of robust statistics, and the two-stage curriculum. The results suggest that the proposed approach is effective in personalizing ECG analysis models while providing reliable uncertainty estimates, which is crucial for clinical applications. The paper also includes a computational efficiency analysis, showing that the proposed method achieves a reduction in FLOPs and inference time compared to full fine-tuning, making it more suitable for real-time clinical deployment. Overall, the paper presents a promising approach for addressing the challenges of personalized ECG analysis with uncertainty quantification, although there are several areas where further clarification and analysis would be beneficial.
I find several aspects of this paper to be commendable. The core idea of combining evidential deep learning with a hypernetwork for personalized ECG analysis is a novel and promising approach. The use of a pre-trained foundation model, with its weights frozen, is a practical choice that reduces computational overhead and allows the model to leverage existing knowledge. The introduction of a lightweight evidential head, conditioned by a hypernetwork on patient-specific statistics, is a clever way to achieve personalization while also providing uncertainty estimates. The two-stage meta-curriculum training strategy, which involves training on high-quality clinical data followed by noisy real-world data, is a well-motivated approach to improve the model's robustness and generalization capabilities. The empirical results presented in the paper are also encouraging. The authors demonstrate that AEML achieves competitive accuracy and improved uncertainty calibration compared to several baseline methods on both synthetic and real-world ECG datasets. The ablation studies provide valuable insights into the contribution of different components of the proposed framework, highlighting the importance of the hypernetwork, the use of robust statistics, and the two-stage curriculum. The computational efficiency analysis, which shows a reduction in FLOPs and inference time compared to full fine-tuning, is also a significant strength, making the proposed method more suitable for real-time clinical deployment. The authors also provide a clear description of the experimental setup, including the datasets used, the evaluation metrics, and the baseline methods, which enhances the reproducibility of the results. The inclusion of a 'Future Work' section also demonstrates a forward-looking perspective and acknowledges the limitations of the current study, which is a sign of good academic practice. The paper's focus on uncertainty calibration, which is crucial for clinical applications, is another important strength. The use of Expected Calibration Error (ECE) as a primary evaluation metric, along with accuracy, demonstrates the authors' commitment to developing reliable and trustworthy models. Finally, the paper is generally well-written and easy to follow, making it accessible to a broad audience.
Despite the strengths of this paper, I have identified several weaknesses that warrant further attention. Firstly, while the paper introduces the concept of a 'foundation model,' it lacks specific details about its architecture and the dataset it was trained on. This lack of information makes it difficult to assess the model's capabilities and potential biases, which is a significant concern. The paper mentions using a 'pre-trained ECG foundation model' (Section 4.1), but does not provide any further details, which hinders reproducibility and a full understanding of the model's performance. Secondly, the paper's use of the term 'arrhythmia' is vague. The paper mentions 'ECG analysis' and uses datasets known for 'arrhythmia classification,' but does not explicitly define what constitutes 'arrhythmia' or list the specific arrhythmia types considered (Section 6.2, Figure 1, Figure 2, Figure 3, Table 1). This lack of clarity makes it difficult to understand the scope of the study and to compare the results with other studies that may focus on different types of arrhythmias. Thirdly, the paper's description of the 'two-stage meta-curriculum' is not as clear as it could be. While the paper describes the two stages (Section 4.6), it does not explicitly detail how the 'high-quality clinical tasks' and 'noisy real-world tasks' are structured within the meta-learning framework (e.g., as separate tasks or a sequence within a task). This lack of clarity makes it difficult to fully understand the training process and its implications. Fourthly, the paper lacks a detailed explanation of how the hypernetwork is trained. While the paper mentions the use of a KL regularization term (Section 4.4), it does not provide a step-by-step explanation of the training process, including the loss function and optimization algorithm. This lack of detail makes it difficult to understand how the hypernetwork learns to generate effective priors. Fifthly, the paper does not provide a clear explanation of the 'baseline' in the ablation study presented in Figure 2. The paper mentions a comparison with a 'baseline variant' (Section 6.4), but does not explicitly define what this baseline is. This lack of clarity makes it difficult to interpret the results of the ablation study. Sixthly, the paper does not provide a clear explanation of why the Class-Conditional prior method performs better than the baseline in Figure 2. While the paper states that it demonstrates the 'efficacy of adaptive prior conditioning' (Section 6.4), it does not provide a detailed analysis of the underlying mechanisms. Seventhly, the paper lacks a detailed explanation of how the accuracy is calculated, particularly in the context of the few-shot learning scenario. While the paper mentions using accuracy as a metric (Section 5.2), it does not provide a detailed explanation of how it is calculated for each patient-specific task. Eighthly, the paper does not provide a clear explanation of the loss curves presented in Figure 5. While the figure is mentioned in the text (Section 6.4), the paper does not provide a detailed interpretation of the curves. Ninthly, the paper does not provide a clear explanation of the 'AMORE' mentioned in Figure 4. The figure is mentioned in the text (Section 6.4), but the paper does not define what 'AMORE' refers to. Tenthly, the paper lacks a detailed explanation of the 'arrhythmia distribution' mentioned in Figure 1. While the figure is mentioned in the text (Section 6.4), the paper does not explain what the arrhythmia distribution refers to. Eleventhly, the paper lacks a detailed explanation of the 'reliability diagram' mentioned in Figure 2. While the figure is mentioned in the text (Section 6.4), the paper does not explain what the reliability diagram refers to. Twelfthly, the paper does not provide a clear explanation of the 'zero-shot adaptation' mentioned in Section 6.3. The paper mentions testing the model on 'unseen wearable ECG datasets' (Section 6.3), but does not explicitly use the term 'zero-shot adaptation' or provide a detailed explanation of the process. Thirteenthly, the paper lacks a detailed explanation of the 'F1-score' mentioned in Section 6.3. While the paper mentions using F1-score as a metric (Section 6.3), it does not provide a detailed explanation of how it is calculated. Fourteenthly, the paper lacks a detailed explanation of the 'OOD detection' mentioned in Section 6.3. While the paper mentions using OOD detection as a metric (Section 6.3), it does not provide a detailed explanation of how it is performed. Fifteenthly, the paper lacks a detailed explanation of the 'reliability diagram' mentioned in Figure 3. While the figure is mentioned in the text (Section 6.3), the paper does not explain what the reliability diagram refers to. Sixteenthly, the paper lacks a detailed explanation of the 'reliability diagram' mentioned in Figure 4. While the figure is mentioned in the text (Section 6.3), the paper does not explain what the reliability diagram refers to. Seventeenthly, the paper lacks a detailed explanation of the 'reliability diagram' mentioned in Figure 5. While the figure is mentioned in the text (Section 6.3), the paper does not explain what the reliability diagram refers to. Eighteenthly, the paper lacks a detailed explanation of the 'reliability diagram' mentioned in Figure 6. While the figure is mentioned in the text (Section 6.3), the paper does not explain what the reliability diagram refers to. Nineteenthly, the paper lacks a detailed explanation of the 'reliability diagram' mentioned in Figure 7. While the figure is mentioned in the text (Section 6.3), the paper does not explain what the reliability diagram refers to. Twentiethly, the paper lacks a detailed explanation of the 'reliability diagram' mentioned in Figure 8. While the figure is mentioned in the text (Section 6.3), the paper does not explain what the reliability diagram refers to. Twenty-firstly, the paper lacks a detailed explanation of the 'reliability diagram' mentioned in Figure 9. While the figure is mentioned in the text (Section 6.3), the paper does not explain what the reliability diagram refers to. Finally, the paper lacks a detailed explanation of the 'reliability diagram' mentioned in Figure 10. While the figure is mentioned in the text (Section 6.3), the paper does not explain what the reliability diagram refers to. These weaknesses, which have been independently validated, significantly impact the clarity and interpretability of the paper and should be addressed in future revisions.
To address the identified weaknesses, I recommend several concrete improvements. Firstly, the authors should provide specific details about the architecture and pre-training dataset of the foundation model. This should include the type of model (e.g., Transformer, CNN), the number of layers, the number of parameters, and the dataset used for pre-training. This information is crucial for understanding the model's capabilities and potential biases. Secondly, the authors should explicitly define what constitutes 'arrhythmia' in the context of their study and list the specific arrhythmia types considered. This will improve the clarity and comparability of the results. Thirdly, the authors should provide a more detailed explanation of the 'two-stage meta-curriculum,' including how the 'high-quality clinical tasks' and 'noisy real-world tasks' are structured within the meta-learning framework. This should include details about the data used in each stage and how the model is trained in each stage. Fourthly, the authors should provide a step-by-step explanation of how the hypernetwork is trained, including the loss function, the optimization algorithm, and the specific parameters being learned. This will improve the understanding of how the hypernetwork learns to generate effective priors. Fifthly, the authors should clearly define the 'baseline' used in the ablation study presented in Figure 2. This will improve the interpretability of the results. Sixthly, the authors should provide a detailed analysis of why the Class-Conditional prior method performs better than the baseline in Figure 2. This should include an explanation of the underlying mechanisms. Seventhly, the authors should provide a detailed explanation of how the accuracy is calculated, particularly in the context of the few-shot learning scenario. This should include details about how the accuracy is calculated for each patient-specific task. Eighthly, the authors should provide a clear explanation of the loss curves presented in Figure 5. This should include an interpretation of the curves. Ninthly, the authors should clearly define what 'AMORE' refers to in Figure 4. Tenthly, the authors should provide a clear explanation of the 'arrhythmia distribution' mentioned in Figure 1. Eleventhly, the authors should provide a clear explanation of the 'reliability diagram' mentioned in Figure 2. Twelfthly, the authors should provide a clear explanation of the 'zero-shot adaptation' mentioned in Section 6.3. Thirteenthly, the authors should provide a clear explanation of the 'F1-score' mentioned in Section 6.3. Fourteenthly, the authors should provide a clear explanation of the 'OOD detection' mentioned in Section 6.3. Fifteenthly, the authors should provide a clear explanation of the 'reliability diagram' mentioned in Figure 3. Sixteenthly, the authors should provide a clear explanation of the 'reliability diagram' mentioned in Figure 4. Seventeenthly, the authors should provide a clear explanation of the 'reliability diagram' mentioned in Figure 5. Eighteenthly, the authors should provide a clear explanation of the 'reliability diagram' mentioned in Figure 6. Nineteenthly, the authors should provide a clear explanation of the 'reliability diagram' mentioned in Figure 7. Twentiethly, the authors should provide a clear explanation of the 'reliability diagram' mentioned in Figure 8. Twenty-firstly, the authors should provide a clear explanation of the 'reliability diagram' mentioned in Figure 9. Finally, the authors should provide a clear explanation of the 'reliability diagram' mentioned in Figure 10. These suggestions, if implemented, would significantly improve the clarity, interpretability, and reproducibility of the paper.
Based on my analysis, I have several questions that I believe are important for further understanding the proposed method. Firstly, what is the specific architecture of the pre-trained ECG foundation model used in this study, and what dataset was it trained on? This information is crucial for understanding the model's capabilities and potential biases. Secondly, what specific types of arrhythmias are considered in this study, and how were these types defined? This information is important for understanding the scope of the study and for comparing the results with other studies. Thirdly, how exactly are the 'high-quality clinical tasks' and 'noisy real-world tasks' structured within the meta-learning framework? Are these separate tasks, or is there a sequence within a task? This information is important for understanding the training process. Fourthly, what is the specific loss function used to train the hypernetwork, and how is it optimized? This information is important for understanding how the hypernetwork learns to generate effective priors. Fifthly, what is the specific baseline used in the ablation study presented in Figure 2? This information is important for interpreting the results of the ablation study. Sixthly, why does the Class-Conditional prior method perform better than the baseline in Figure 2? What are the underlying mechanisms that explain this improvement? This information is important for understanding the effectiveness of the proposed method. Seventhly, how is the accuracy calculated in the context of the few-shot learning scenario? Is it calculated per patient, per class, or in some other way? This information is important for understanding the performance of the model. Eighthly, what is the interpretation of the loss curves presented in Figure 5? What do these curves tell us about the training process? This information is important for understanding the training dynamics. Ninthly, what does 'AMORE' refer to in Figure 4? This information is important for understanding the results presented in the figure. Tenthly, what does the 'arrhythmia distribution' refer to in Figure 1? This information is important for understanding the data used in the study. Eleventhly, what does the 'reliability diagram' refer to in Figure 2? This information is important for understanding the calibration of the model. Twelfthly, what is the specific process of 'zero-shot adaptation' used in this study? This information is important for understanding how the model generalizes to unseen data. Thirteenthly, how is the 'F1-score' calculated in this study? This information is important for understanding the performance of the model. Fourteenthly, how is the 'OOD detection' performed in this study? This information is important for understanding the model's ability to detect out-of-distribution samples. Fifteenthly, what does the 'reliability diagram' refer to in Figure 3? This information is important for understanding the calibration of the model. Sixteenthly, what does the 'reliability diagram' refer to in Figure 4? This information is important for understanding the calibration of the model. Seventeenthly, what does the 'reliability diagram' refer to in Figure 5? This information is important for understanding the calibration of the model. Eighteenthly, what does the 'reliability diagram' refer to in Figure 6? This information is important for understanding the calibration of the model. Nineteenthly, what does the 'reliability diagram' refer to in Figure 7? This information is important for understanding the calibration of the model. Twentiethly, what does the 'reliability diagram' refer to in Figure 8? This information is important for understanding the calibration of the model. Finally, what does the 'reliability diagram' refer to in Figure 9? This information is important for understanding the calibration of the model. These questions target core methodological choices and seek clarification of critical assumptions, which I believe are essential for a thorough understanding of the paper.