This paper introduces a method for enhancing the training of small language models (SLMs) by injecting noise into the gradients during the backward pass. The core idea is to add Gaussian noise with a decaying variance schedule, specifically an exponential decay, to the gradients before the optimizer updates the model parameters. The authors argue that this approach encourages exploration of the parameter space, preventing the model from converging prematurely to sharp local minima, which can lead to overfitting. The method is presented as a lightweight regularization technique, requiring minimal modifications to existing training pipelines. The authors conduct experiments on three character-level language modeling datasets: Shakespeare, enwik8, and text8. They compare the performance of models trained with their proposed gradient noise injection method against baseline models trained without any explicit regularization. The primary evaluation metric is the training and validation loss. The empirical results show that the proposed method achieves lower training and validation losses compared to the baseline in some cases, particularly on the Shakespeare dataset. However, the improvements are not consistent across all datasets, with the enwik8 and text8 datasets showing less pronounced gains. The paper also includes a discussion of the computational overhead of the method, claiming it is negligible. The authors provide code and trained models to ensure reproducibility. While the paper presents a simple and potentially useful technique, it lacks a strong theoretical foundation and a comprehensive empirical evaluation. The paper's main contribution lies in the empirical demonstration of gradient noise injection for small language models, but the lack of theoretical analysis and limited scope of the experiments raise questions about the method's generalizability and robustness. The paper also does not provide a detailed comparison with other regularization techniques, such as dropout or weight decay, which limits the understanding of its relative advantages and disadvantages. Overall, the paper presents an interesting approach to training small language models, but it requires further investigation to fully understand its potential and limitations.