This paper explores the use of gradient noise injection as a regularization technique for training small language models, proposing the addition of Gaussian noise to the gradients during training with an exponentially decaying noise schedule. The authors argue that this approach balances exploration and stability during the training process. They conduct experiments on several text corpora, comparing models trained with their proposed method to baseline models trained without gradient noise injection. The core idea is to revisit and extend the concept of gradient noise injection, first explored in earlier studies, and adapt it to modern small language models. The authors claim improvements in training dynamics, validation loss, and final performance compared to baseline models. Specifically, they explore different noise schedules, including exponential, linear, and adaptive decay, and find that exponential decay provides the best balance between exploration and convergence stability. The experimental setup involves varying initial noise levels and decay rates to study their impact on training dynamics and final performance. The results section presents validation loss curves and final validation loss values, suggesting that models trained with gradient noise injection achieve lower validation loss compared to baseline models. The paper also includes a statement about the computational efficiency of the proposed method, claiming that it introduces negligible overhead and integrates seamlessly into standard training pipelines. However, the paper lacks a detailed quantitative analysis of the computational cost and does not provide statistical significance tests or error bars for the reported results. The paper's contribution lies in adapting gradient noise injection to small language models and exploring different noise schedules, but the experimental evaluation and analysis could be more rigorous to fully support the claims made.