This paper introduces PHYsGYM, a novel benchmark suite and simulation platform designed to evaluate the scientific reasoning capabilities of large language models (LLMs) within interactive physics environments. The core innovation of PHYsGYM lies in its ability to systematically control the level of prior knowledge provided to the LLM agents, allowing for a nuanced analysis of how prior knowledge impacts model performance. The benchmark comprises 97 physics problems spanning six fundamental domains, including mechanics, electricity, and optics, among others. These problems are presented as interactive environments where LLMs can propose experiments, formulate hypotheses, and iteratively refine their understanding of underlying physical laws. The platform provides a structured interface for agents to interact with the environment, conduct experiments, and formulate hypotheses about the governing equations. The authors evaluate several representative LLMs, including both open-source and proprietary models, demonstrating that PHYsGYM can effectively differentiate model capabilities based on varying levels of prior knowledge and task complexity. The empirical findings reveal that performance generally decreases as prior knowledge is reduced, highlighting the importance of prior knowledge for LLMs in scientific reasoning. Furthermore, the study shows that models struggle to consistently utilize prior knowledge across different contexts, and that prior knowledge can sometimes hinder performance on more complex tasks. The paper also explores the impact of task difficulty, measured by equation length and variable count, on model performance. Overall, this work makes a significant contribution to the field by providing a benchmark that allows for a more fine-grained analysis of LLMs' scientific reasoning abilities, particularly in the context of interactive environments and controlled prior knowledge. The platform's ability to simulate physics experiments and evaluate model hypotheses provides a valuable tool for advancing the development of more robust and reliable AI systems for scientific research. The authors also provide a detailed analysis of the results, highlighting the strengths and weaknesses of different models and the impact of prior knowledge on their performance. This work is significant because it addresses a critical gap in the evaluation of AI models for scientific discovery by providing a benchmark that allows for systematic control over prior knowledge. The ability to systematically analyze how prior knowledge impacts model performance has significant implications for the development of more robust and reliable AI systems for scientific research.