2510.0089 BasketVision: Benchmarking MLLMs' Grasp of Complex Dynamic Systems v1

🎯 ICAIS2025 Accepted Paper

🎓 Meta Review & Human Decision

Decision:

Spotlight Accept

Meta Review:

AI Review from DeepReviewer

AI Review available after:
--d --h --m --s

📋 AI Review from DeepReviewer will be automatically processed

📋 Summary

This paper introduces BasketVision, a novel benchmark designed to evaluate the capabilities of Multimodal Large Language Models (MLLMs) in understanding complex dynamic systems, specifically within the domain of professional basketball. The authors argue that basketball, with its structured rules, multi-agent interactions, and dynamic spatial and temporal elements, provides an ideal testbed for assessing MLLMs' reasoning abilities. The core contribution of this work is the creation of a large-scale dataset comprising 6,000 curated questions across seven distinct dimensions of perception, reasoning, and prediction. These dimensions encompass tasks such as scene comprehension, object detection, spatial localization, event analysis, context understanding, tracking and trajectory analysis, and reasoning and strategy analysis. The authors have developed an automated data generation pipeline that includes court recognition, perspective transformation, and player tracking, which allows for the creation of spatially-grounded questions at scale. This pipeline is a significant technical contribution, enabling the generation of a large and diverse dataset. The benchmark utilizes both image and video formats, further enhancing its complexity and relevance to real-world scenarios. The authors evaluated 23 state-of-the-art MLLMs on the BasketVision benchmark, revealing a substantial performance gap between human experts and the best-performing model. This gap underscores the limitations of current MLLMs in spatial reasoning and their ability to handle complex dynamic visual environments. The paper's empirical findings highlight the need for further research into developing more robust reasoning capabilities in MLLMs, particularly in the context of dynamic systems. The authors' work is significant because it moves beyond static images and generic video content to assess models in a structured, rule-governed environment, providing a more rigorous evaluation of MLLM capabilities. The benchmark is presented as a valuable resource for the research community, offering a challenging and comprehensive testbed for future advancements in MLLM research. The paper's focus on a specific domain like basketball, while providing a controlled environment, also raises questions about the generalizability of the findings to other complex dynamic systems. Overall, the paper makes a valuable contribution by introducing a novel benchmark that highlights the limitations of current MLLMs in understanding complex dynamic systems, and it provides a solid foundation for future research in this area.

✅ Strengths

The primary strength of this paper lies in the introduction of BasketVision, a novel benchmark specifically designed to evaluate MLLMs in the context of professional basketball, a complex dynamic system. This is a significant contribution as it moves beyond static images and generic video content to assess models in a structured, rule-governed environment. The authors' choice of basketball as a microcosm is well-justified, given the sport's inherent complexity, involving multiple agents, spatial and temporal dynamics, and explicit rules. The development of an automated data generation pipeline is another notable strength. This pipeline, which includes court recognition, perspective transformation, and player tracking, is scalable and ensures fine-grained precision in the generated data. This technical innovation is crucial for creating large-scale datasets that are necessary for evaluating models in complex dynamic systems. The pipeline's ability to generate spatially-grounded questions at scale is a key advantage, allowing for a more comprehensive evaluation of MLLM capabilities. The benchmark itself is comprehensive, covering a wide range of tasks including scene comprehension, object detection, spatial localization, event analysis, context understanding, tracking and trajectory analysis, and reasoning and strategy analysis. This thorough evaluation provides a holistic view of MLLM capabilities, assessing not only their perception skills but also their ability to reason and predict in dynamic environments. The inclusion of both image and video formats further enhances the benchmark's complexity and relevance to real-world scenarios. The extensive evaluation of 23 state-of-the-art MLLMs on the BasketVision benchmark is another strength. This evaluation provides valuable insights into the current limitations of these models, highlighting a significant performance gap between human experts and the best-performing model. This gap underscores the need for further research in this area and provides a clear direction for future work. The paper's analysis of the results is also insightful, identifying specific areas where MLLMs struggle, such as spatial reasoning and understanding complex interactions. The authors' work is significant because it provides a challenging and comprehensive testbed for future advancements in MLLM research. The benchmark is presented as a valuable resource for the research community, offering a rigorous evaluation framework for assessing models in complex dynamic systems. The paper's focus on a specific domain like basketball, while raising questions about generalizability, also allows for a more controlled and detailed analysis of MLLM capabilities within that domain. Overall, the paper's strengths lie in its novel benchmark, its automated data generation pipeline, its comprehensive evaluation, and its insightful analysis of the results.

❌ Weaknesses

While the paper presents a valuable contribution with the BasketVision benchmark, several weaknesses warrant careful consideration. The most significant limitation is the benchmark's primary focus on professional basketball as a microcosm for evaluating MLLMs in complex dynamic systems. While the authors justify this choice by highlighting the structured nature of basketball, the findings may not directly generalize to other dynamic systems with different rules, physics, or agent behaviors. The specific spatial relationships and temporal dynamics inherent in basketball may not translate well to domains like autonomous driving or robotic manipulation, where the underlying physical constraints and agent interactions differ significantly. This raises concerns about the broader applicability of the benchmark for evaluating general reasoning capabilities in diverse dynamic environments. The paper does not provide any evidence of how models trained on BasketVision perform on other dynamic systems, which limits the scope of the benchmark. This lack of generalization is a critical weakness, as it restricts the impact of the benchmark to a specific domain. My analysis confirms that the paper's experimental design and analysis are entirely focused on basketball, with no attempt to validate the findings in other contexts. This is a high-confidence concern, as it directly impacts the broader applicability of the research. Another significant weakness is the dataset's reliance on professional sports broadcasts, which predominantly feature a limited set of standardized camera angles. The performance of models may vary significantly when faced with different viewpoints, such as player-worn cameras or amateur footage, which are not represented in the current dataset. The lack of diversity in camera perspectives could lead to models that are overly specialized to the specific viewpoints present in the dataset, and thus fail to generalize to more varied real-world scenarios. This is a critical limitation, as real-world dynamic systems often involve a wide range of camera angles and visual conditions. My analysis of the paper confirms that the data source is explicitly stated as professional sports broadcasts, and there is no discussion of incorporating diverse camera angles. This is a high-confidence concern, as it directly affects the robustness of the models trained on the benchmark. Furthermore, the paper acknowledges that probabilistic generative models often struggle with numerical precision, particularly for tasks requiring multi-step reasoning. While the authors employ a hybrid assessment strategy to mitigate this issue, the potential for numerical hallucination in open-ended questions remains a concern. The use of multiple-choice questions for numerical tasks, while addressing the issue of precision, may not fully capture the models' ability to perform complex numerical reasoning. The open-ended questions, while allowing for more nuanced responses, are susceptible to the generation of plausible but incorrect numerical answers, which could skew the evaluation results. My analysis of the paper confirms that the authors use a hybrid approach, but the potential for numerical hallucination in open-ended questions is not fully addressed. This is a medium-confidence concern, as the hybrid approach attempts to mitigate the issue, but the potential for errors in open-ended responses remains. The paper also lacks a detailed analysis of the questions in terms of the types of reasoning required. While the authors categorize questions into seven dimensions, they do not provide a breakdown of questions by the type of reasoning involved, such as numerical, spatial, or commonsense reasoning. This lack of analysis makes it difficult to understand the specific reasoning capabilities that the benchmark is evaluating. My analysis confirms that the paper does not provide a detailed breakdown of reasoning types within the seven dimensions. This is a high-confidence concern, as it limits the interpretability of the results. Additionally, the paper does not provide a detailed analysis of the questions in terms of difficulty level. The authors do not categorize or analyze the questions based on their difficulty, which makes it difficult to assess the benchmark's ability to evaluate models across a range of complexities. My analysis confirms that the paper does not include a difficulty level analysis. This is a high-confidence concern, as it limits the benchmark's ability to provide a nuanced evaluation of MLLM capabilities. The paper also lacks a detailed analysis of the questions in terms of the number of reasoning steps required. The authors do not analyze the complexity of the reasoning process needed to answer the questions, which makes it difficult to assess the models' ability to perform multi-step reasoning. My analysis confirms that the paper does not include an analysis of reasoning steps. This is a high-confidence concern, as it limits the benchmark's ability to evaluate complex reasoning abilities. Finally, the paper lacks a detailed analysis of the questions in terms of the length of the questions and answers. The authors do not analyze the length of the questions or answers, which could provide insights into the complexity of the tasks. My analysis confirms that the paper does not include a length analysis. This is a high-confidence concern, as it limits the benchmark's ability to provide a comprehensive evaluation of MLLM capabilities. The paper also lacks a detailed analysis of the diversity of frames and videos used in the benchmark. The authors do not provide information about the number of frames per video segment, the resolution of the frames, or other characteristics of the visual data. This lack of analysis makes it difficult to assess the benchmark's ability to evaluate models across a range of visual conditions. My analysis confirms that the paper does not include a detailed analysis of frame and video diversity. This is a high-confidence concern, as it limits the benchmark's ability to provide a comprehensive evaluation of MLLM capabilities. In summary, the paper's weaknesses primarily stem from the limited scope of the benchmark, the lack of diversity in the data, and the absence of detailed analyses of the questions and visual data. These weaknesses limit the generalizability of the findings and the comprehensiveness of the evaluation.

💡 Suggestions

To address the identified weaknesses, I recommend several concrete improvements. First, to mitigate the limitations of domain specificity, the authors should consider expanding the benchmark to include other types of dynamic systems beyond basketball. This could involve adapting the data generation pipeline to different environments, such as simulated robotic tasks or autonomous driving scenarios. For example, the pipeline could be modified to generate data from simulations of robotic manipulation tasks, where the agent needs to interact with objects in a dynamic environment. This would require developing new algorithms for object detection, tracking, and spatial reasoning in these different contexts. Furthermore, the authors could explore the use of synthetic data generation techniques to create diverse scenarios and reduce the reliance on real-world data. This would allow for a more comprehensive evaluation of MLLMs across a wider range of dynamic systems. Additionally, the authors should investigate the transferability of models trained on BasketVision to other domains, which would provide insights into the generalizability of the learned representations. This could involve evaluating the models on existing benchmarks for other dynamic systems, or creating new benchmarks in those domains. Second, to mitigate the issue of viewpoint and data bias, the authors should incorporate data from a wider variety of camera angles and visual conditions. This could involve collecting data from player-worn cameras, drones, and other non-standard viewpoints. Additionally, the authors should consider augmenting the existing dataset with synthetic data that simulates different camera angles and visual conditions. This would help to ensure that the models are robust to variations in viewpoint and visual quality. Furthermore, the authors could explore the use of domain adaptation techniques to improve the models' ability to generalize to new viewpoints and visual conditions. This would involve training the models on a diverse set of data and then fine-tuning them on the target domain. The authors should also consider the impact of video quality and resolution on model performance, as these factors can significantly affect the ability of models to extract relevant information. Third, to further address the issue of numerical hallucination, the authors should explore alternative evaluation strategies for open-ended questions. This could involve using a combination of automated and human evaluation to assess the accuracy and precision of the models' numerical reasoning. For example, the authors could use automated tools to check the numerical answers against the ground truth, and then use human evaluators to assess the quality of the reasoning process. Additionally, the authors could explore the use of techniques such as chain-of-thought prompting to encourage the models to provide more detailed and transparent reasoning for their numerical answers. This would allow for a more nuanced evaluation of the models' numerical reasoning capabilities and help to identify the specific types of errors that they are prone to make. The authors should also consider the use of more robust numerical evaluation metrics that are less sensitive to minor variations in numerical answers. Fourth, the authors should conduct a more detailed analysis of the questions, categorizing them by the types of reasoning required (e.g., numerical, spatial, commonsense), difficulty level, number of reasoning steps, and length of questions and answers. This analysis would provide a more granular understanding of the benchmark's composition and the specific capabilities it evaluates. For example, the authors could create categories such as 'numerical reasoning,' 'spatial reasoning,' 'rule-based reasoning,' 'strategy analysis,' and 'event understanding,' and then quantify the number of questions that fall into each category. Similarly, they could define different difficulty levels based on the complexity of the reasoning required and provide the distribution of questions across these levels. Fifth, the authors should provide a more detailed analysis of the diversity of the frames and videos used in the benchmark. This analysis should include information about the number of frames per video segment, the resolution of the frames, and other characteristics of the visual data. This would help to better understand the benchmark's reliance on different types of visual information and the specific challenges it poses to MLLMs. Finally, the authors should consider releasing the data and code for the benchmark to the research community. This would allow other researchers to build upon their work and contribute to the development of more robust MLLMs. These suggestions aim to address the identified weaknesses and enhance the value and impact of the BasketVision benchmark.

❓ Questions

Based on my analysis, I have several questions that I believe are crucial for further understanding the paper's contributions and limitations. First, how does the performance of MLLMs on BasketVision correlate with their performance on other benchmarks that evaluate different types of reasoning and understanding? Is there evidence that improvements on BasketVision translate to improvements in other domains? This question is important because it addresses the generalizability of the benchmark and its relevance to the broader field of MLLM research. Second, the paper mentions that the dataset is curated from professional basketball games. How does the performance of MLLMs vary when evaluated on data from less structured or amateur-level games, where the rules might be less consistently followed and the visual quality might be lower? This question is important because it addresses the robustness of the models to variations in data quality and structure. Third, the automated data generation pipeline is a significant technical contribution. What are the limitations of this pipeline, and how might it be extended or adapted to other domains beyond basketball? Are there any potential biases introduced by the automation process? This question is important because it addresses the scalability and adaptability of the pipeline and its potential impact on the benchmark's results. Fourth, the paper highlights the significant performance gap between human experts and MLLMs. What are the most promising directions for future research to bridge this gap? Are there specific architectural changes or training strategies that could lead to more human-like understanding of dynamic systems? This question is important because it addresses the future directions of MLLM research and the potential for significant advancements in this field. Fifth, how does the choice of camera angles and the quality of the video footage affect the performance of MLLMs on the BasketVision benchmark? Is there a minimum quality threshold for the video data, below which the models' performance degrades significantly? This question is important because it addresses the robustness of the models to variations in visual data and the potential for biases in the benchmark. Sixth, what is the distribution of questions across the seven dimensions of evaluation? Are some dimensions more challenging for MLLMs than others, and if so, why? This question is important because it addresses the specific capabilities that the benchmark is evaluating and the areas where MLLMs struggle the most. Seventh, what is the distribution of questions in terms of the types of reasoning required (e.g., numerical, spatial, commonsense)? This question is important because it addresses the specific reasoning abilities that the benchmark is evaluating and the areas where MLLMs struggle the most. Eighth, what is the distribution of questions in terms of difficulty level? Are there specific types of questions that are particularly challenging for MLLMs? This question is important because it addresses the benchmark's ability to evaluate models across a range of complexities. Ninth, what is the distribution of questions in terms of the number of reasoning steps required? Are there specific types of questions that require multi-step reasoning? This question is important because it addresses the benchmark's ability to evaluate complex reasoning abilities. Finally, what is the distribution of questions in terms of the length of the questions and answers? Are there specific types of questions that require longer and more detailed responses? This question is important because it addresses the benchmark's ability to evaluate models across a range of question and answer complexities. These questions aim to clarify key uncertainties and assumptions in the paper and to guide future research in this area.

📊 Scores

Soundness:3.0
Presentation:3.0
Contribution:3.0
Rating: 5.5

AI Review from ZGCA

ZGCA Review available after:
--d --h --m --s

📋 AI Review from ZGCA will be automatically processed

📋 Summary

The paper introduces BasketVision, a multimodal benchmark to evaluate MLLMs' understanding of complex dynamic systems using professional basketball as a structured microcosm. The benchmark comprises 6,000 curated, bilingual questions spanning seven capability dimensions (Scene Comprehension, Object Detection, Spatial Localization, Event Analysis, Context Understanding, Tracking & Trajectory Analysis, Reasoning & Strategy Analysis) with both image and video inputs (Table 1; Fig. 2). The authors propose an automated data generation pipeline integrating court recognition and homography (Alg. 1–2; Fig. 3–4), player detection/ReID (Alg. 3; Fig. 5), and projection onto a standard 2D court (Alg. 4). Data are drawn from NBA/CBA playoffs and multiple Olympic Games (Sec. 3.3), yielding 2,400 images and 1,200 video segments, and over 2,300 minutes of gameplay. The evaluation spans 23 models (proprietary/open-source; Sec. 4.1), with a human expert baseline. Results (Table 4) show a large gap between human experts (96.34%) and top model GPT-4o (63.15%), and identify Spatial Localization as the most challenging dimension (top score 53.86%). The paper analyzes task specialization and argues for architectural limitations in spatial reasoning.

✅ Strengths

  • Timely and clear problem formulation: evaluating MLLMs on complex, rule-governed dynamic systems rather than generic or static content (Sec. 1–2).
  • Well-scoped benchmark with seven dimensions targeting perception, spatio-temporal reasoning, and high-level strategy (Table 1; Fig. 2).
  • Automated pipeline that makes spatially grounded question generation feasible at scale (Algs. 1–4; Figs. 3–5), described with sufficient algorithmic detail to be replicable.
  • Broad and systematic model evaluation across 23 systems, with a human expert baseline (Sec. 4.1; Table 4).
  • Clear, actionable insights: persistent spatial reasoning bottleneck, specialization patterns, and a large human–machine gap (Sec. 4.2; Table 4).
  • Bilingual design, potentially facilitating cross-lingual evaluation.
  • Practical evaluation design choice (MC vs SA) to mitigate numerical hallucination while preserving depth (Sec. 3.1.2).

❌ Weaknesses

  • No statistical significance reporting or uncertainty estimates (confidence intervals/variance) for model comparisons or task-level insights (Sec. 4.2); this limits the conclusiveness of some claims.
  • Open-ended scoring uses a rubric and two annotators (Sec. 4.1) but lacks inter-annotator agreement metrics (e.g., Cohen’s kappa/Krippendorff’s alpha) and examples of borderline cases.
  • Reproducibility and impact hinge on resource release, but the paper does not state whether/when code, data, and evaluation scripts will be released, nor does it address licensing for broadcast footage (Sec. 3.3).
  • Potential evaluation bias: for models without native video support, 1 FPS key-frame sampling (Sec. 4.1) may be too sparse; no sensitivity analysis is reported to show conclusions are robust to sampling.
  • No quantitative quality audit of the pipeline stages (e.g., court detection accuracy against ground truth, homography reprojection error, ReID IDF1/MOTA), leaving question quality partly unquantified (Sec. 3.4).
  • Domain specificity: generalization beyond professional basketball is discussed but not empirically probed (Sec. 4.3).
  • Limited transparency on prompt templates, decoding settings, and normalization of answer formats, which can affect fair comparison across proprietary and open-source models.
  • Bilingual aspect is not analyzed: no per-language breakdown or evidence that items are semantically equivalent across languages.

❓ Questions

  • Will you release the dataset (images/videos/questions/answers), code for the pipeline (Algs. 1–4), and evaluation scripts, and under what licenses? Please clarify how broadcast footage licensing is handled.
  • Can you report inter-annotator agreement (e.g., Cohen’s kappa/Krippendorff’s alpha) for Scene Comprehension and Reasoning & Strategy Analysis, and provide examples of disagreements and resolution criteria?
  • Please add statistical reporting: confidence intervals or bootstrapped CIs per model and per task, and significance tests for key comparisons (e.g., GPT-4o vs second-best model; task-level gaps).
  • How robust are the conclusions to the frame sampling rate? Can you report a sensitivity study varying FPS (e.g., 0.5, 1, 2, 4 FPS) for several representative models?
  • What quantitative QC do you have for the CV pipeline? For instance, homography reprojection error distributions, court keypoint detection accuracy, ReID metrics (IDF1/MOTA), and effects of these errors on question generation.
  • Do you control for answer-only biases? For MC items, have you tested models on text-only variants of choices/questions to ensure visual grounding?
  • Please provide per-language performance (EN vs ZH), and confirm bilingual equivalence of items (if parallelized) or describe differences in content if not perfectly parallel.
  • How were prompts and decoding settings standardized across models (especially proprietary APIs)? Can you provide the exact templates and temperature/top-p settings?
  • Human baseline: you mention 3 experts and 13 undergraduates. Please report the undergraduates’ performance and variance, and the number of human annotations per question type.
  • Some tasks (e.g., Spatial Localization) might depend on ball visibility. Is the ball tracked/used in question generation? If not, could this confound tasks like Event Analysis? Please clarify.
  • Did you consider potential training data leakage (e.g., popular NBA plays in web-scale corpora)? Any steps to reduce overlap (e.g., time-based splits, less iconic games)?
  • Can you add one small cross-domain test (e.g., a different sport or non-broadcast viewpoints) to probe generalization of your claims about spatial reasoning bottlenecks?

⚠️ Limitations

  • Domain specificity: conclusions may not generalize to other dynamic systems (acknowledged in Sec. 4.3). Extending to other sports or agent-based domains would bolster generality.
  • Viewpoint bias from broadcast angles (acknowledged in Sec. 4.3); robustness to egocentric or unconventional camera views is untested.
  • Evaluation bias due to 1 FPS sampling for video-incompatible models could disadvantage systems that rely on motion cues; a sensitivity analysis is needed.
  • Open-ended scoring subjectivity: rubric is described but lacks agreement metrics and calibration examples, which may introduce grading variance.
  • Potential licensing/IP issues tied to broadcast footage; public release must address rights, de-identification if needed, and clear usage constraints.
  • Possible misuse risks: enhanced automated analysis of player tactics could be repurposed for surveillance or gambling edge; mitigation includes usage policies, dataset license restrictions, and ethical guidelines.
  • No quantitative audits of the data generation pipeline’s error rates; downstream question fidelity may be sensitive to homography/ReID errors.

🖼️ Image Evaluation

Cross‑Modal Consistency: 28/50

Textual Logical Soundness: 20/30

Visual Aesthetics & Clarity: 14/20

Overall Score: 62/100

Detailed Evaluation (≤500 words):

Image‑First Understanding (visual ground truth)

• Figure 1/(a)(b)(c): Three‑pane pipeline. (a) data sourcing/QA; (b) automated modules (court recognition→projection→tracking→trajectory); (c) data check→manual annotation/validation→“6000 QA pairs”. Flow with icons, arrows, green/orange/blue palettes.

• Figure 2: Single composite showing seven task families; examples include film‑strip frames, MC options, short answers; colored boxes/headings.

• Figure 3: Broadcast frame with detected lines/keypoints/bboxes overlaid; scoreboard/shot‑clock visible; turquoise court polylines.

• Figure 4: Mosaic of Olympics clips; top: original broadcasts; bottom: top‑down projections; no axis/legend.

• Figure 5: Two rows of detections (bboxes+poses) and right‑side 2D court scatter plots of projected player positions.

Synopsis: Figs 1–5 illustrate the pipeline (detect→project→track) and qualitative examples/projections used to generate and evaluate QA pairs across tasks.

1. Cross‑Modal Consistency

• Major 1: Table reference mismatch. Evidence: Sec 3.5.1 “Table 2 provides a detailed statistical breakdown.” → Table 2 actually lists model accuracies, not dataset stats.

• Major 2: Dataset size inconsistency. Evidence: Sec 3.3 “dataset comprises 6,000 high‑quality images…1,200 video segments”; Table 3 shows “Image: 2400, Video: 1200”.

• Major 3: Figure naming inconsistency. Evidence: Fig. 2 title panel uses “BasketBench” while paper’s benchmark is BasketVision.

• Major 4: Redundant/confusing tables. Evidence: Sec 4.2 “Table 4: Detailed performance…”; but Table 2 contains the same per‑model accuracy matrix.

• Minor 1: Task naming drift. Evidence: Fig. 2 uses “Video‑Context Understanding”; Table 1 uses “Context Understanding”.

• Minor 2: Fig. 4 lacks per‑pane year labels or top/bottom annotations; text mentions four Olympics.

2. Text Logic

• Major 1: None blocking the main results chain (pipeline→benchmark→evaluation).

• Minor 1: No inter‑annotator agreement reported for open‑ended scoring. Evidence: Sec 4.1 rubric described, but κ/α not given.

• Minor 2: Claims of “fine‑grained precision” lack quantitative QC metrics (e.g., homography error, ReID accuracy). Evidence: Sec 3.5.1/3.4.5 narrative only.

3. Figure Quality

• Major 1: No Major issues found.

• Minor 1: Small fonts within Figs 1–2 (box texts, options) are likely illegible at print size.

• Minor 2 (Figure‑Alone Test): Fig. 4/5 need legends: label top vs. projection; add team‑color/ID legend and coordinate axes; Fig. 3 should mark detected keypoints/lines with a legend.

Key strengths:

• Clear end‑to‑end pipeline with homography‑based grounding and ReID.

• Comprehensive, domain‑structured evaluation across seven dimensions with bilingual QA.

• Results analysis ties task specialization to architectural limits; several claims numerically supported (e.g., GPT‑4o 63.15%, SL bottleneck 53.86%).

Key weaknesses:

• Serious figure–text/table mismatches (Tables 2/3/4; Fig. 2 naming).

• Dataset scale inconsistencies reduce credibility.

• Missing quantitative QC for pipeline accuracy; limited figure legends hinder stand‑alone comprehension.

📊 Scores

Originality:3
Quality:3
Clarity:3
Significance:3
Soundness:3
Presentation:3
Contribution:3
Rating: 7

AI Review from SafeReviewer

SafeReviewer Review available after:
--d --h --m --s

📋 AI Review from SafeReviewer will be automatically processed

📋 Summary

This paper introduces BasketVision, a novel benchmark designed to evaluate the capabilities of Multimodal Large Language Models (MLLMs) in understanding complex dynamic systems, using professional basketball as a testbed. The authors propose a comprehensive evaluation framework spanning seven dimensions, including scene comprehension, object detection, spatial localization, event analysis, context understanding, tracking and trajectory analysis, and reasoning and strategy analysis. To facilitate this evaluation, they present an automated data generation pipeline that leverages computer vision techniques to extract relevant information from basketball game footage, generating a dataset of over 6,000 curated questions in both image and video formats. The core contribution of this work lies in the creation of a specialized benchmark that addresses a gap in existing evaluations, which often focus on static images or generic video content. The authors' methodology involves a combination of automated processes, including court recognition, player detection, and tracking, coupled with manual annotation to ensure the quality and accuracy of the generated questions and answers. The empirical findings reveal a significant performance gap between current MLLMs and human experts, highlighting the challenges these models face in spatial reasoning and strategic understanding within dynamic environments. Specifically, the results demonstrate that while MLLMs perform reasonably well on tasks such as event analysis, they struggle with tasks requiring precise spatial localization and complex reasoning about game strategies. The authors also observe a trade-off between perceptual acuity and general reasoning, indicating that models optimized for specific perceptual tasks may underperform in broader reasoning tasks. The overall significance of this work is the introduction of a challenging and domain-specific benchmark that can serve as a valuable tool for the community to assess and improve the capabilities of MLLMs in understanding dynamic visual environments. The BasketVision benchmark provides a more rigorous testbed than existing benchmarks, pushing the boundaries of MLLM capabilities in sports analytics and dynamic scene understanding. The authors' work also highlights the need for further research into developing models that can effectively integrate visual and textual information to reason about complex, dynamic systems. The paper's findings suggest that current MLLMs, despite their advancements, still have significant limitations in understanding the nuances of dynamic environments, particularly in tasks requiring spatial reasoning and strategic analysis. This work provides a valuable contribution to the field by identifying these limitations and providing a benchmark for future research to address them.

✅ Strengths

I find several aspects of this paper to be particularly strong. The most notable is the introduction of BasketVision as a novel benchmark specifically designed to evaluate MLLMs in the context of a complex dynamic system. This is a significant contribution, as it addresses a gap in existing benchmarks that often focus on static images or generic video content. The choice of professional basketball as a testbed is also well-justified, given the sport's structured, multi-agent nature and the need for models to understand spatial, temporal, and strategic elements. The authors' decision to evaluate models across seven distinct dimensions, including scene comprehension, object detection, spatial localization, event analysis, context understanding, tracking and trajectory analysis, and reasoning and strategy analysis, provides a comprehensive assessment of MLLM capabilities. This multi-faceted approach allows for a more nuanced understanding of the strengths and weaknesses of different models. Furthermore, the development of an automated data generation pipeline is a significant technical innovation. This pipeline, which includes court recognition, perspective transformation, and player tracking, enables the creation of a large-scale dataset with fine-grained precision. The use of both image and video data further enhances the complexity and realism of the benchmark. The paper also presents a thorough evaluation of 23 MLLMs, providing valuable insights into their performance across the different evaluation dimensions. The results clearly demonstrate the challenges that current MLLMs face in spatial reasoning and strategic understanding, highlighting the need for further research in these areas. The paper's clear and well-structured presentation also contributes to its strengths. The authors effectively communicate their methodology, findings, and implications, making the paper accessible to a broad audience. The inclusion of detailed experimental results and analysis further strengthens the paper's credibility. The authors' work provides a valuable resource for the community, offering a challenging benchmark that can drive future research in MLLM development and evaluation. The paper's focus on a specific, complex domain like basketball also allows for a more targeted analysis of model capabilities, which is a significant advantage over more general benchmarks. The authors have successfully created a valuable tool for assessing and improving the capabilities of MLLMs in understanding dynamic visual environments.

❌ Weaknesses

Despite the strengths of this paper, I have identified several weaknesses that warrant careful consideration. First, the paper's reliance on a template-based approach for question generation is a significant limitation. While the authors mention using "domain-specific templates" (Section 3.4.5), they do not provide details on the diversity of these templates or methods to ensure the questions are not simply variations of a few base patterns. This raises concerns about the potential for the benchmark to be gamed by models that learn to recognize and respond to these templates rather than demonstrating a genuine understanding of the underlying basketball dynamics. The lack of detail on template creation and validation further exacerbates this concern. This weakness is supported by the absence of any discussion on template design or validation in the method section, and the statement about using "domain-specific templates" without further elaboration. This limitation has a high confidence level, as it is directly observable from the paper's content. The impact of this weakness is that the benchmark may not fully assess the models' understanding of complex basketball scenarios and may lead to inflated performance metrics. Second, the paper's evaluation methodology, while comprehensive, has some limitations. The use of multiple-choice questions, while mitigating numerical hallucination, may not fully capture the models' understanding. The paper does not provide examples of multiple-choice options, making it difficult to assess the difficulty and potential biases in the questions. Additionally, the evaluation of open-ended questions relies on human annotators, but the paper lacks details on the annotation process, such as the specific rubric used and the measures taken to ensure inter-annotator agreement. This lack of transparency raises concerns about the subjectivity and consistency of the evaluation. The paper mentions that "Two trained annotators...evaluated each generated response...In cases of disagreement, a third senior annotator made the final decision" (Section 4.1), but it does not provide details on the training of the annotators, the specific rubric used, or any quantitative measures of inter-annotator agreement. This weakness has a medium confidence level, as the paper does mention human evaluation, but lacks crucial details. The impact of this weakness is that the evaluation results may be subjective and not fully representative of the models' true capabilities. Third, the paper's analysis of the results, while insightful, could be more in-depth. The authors identify spatial reasoning as a significant bottleneck and observe a trade-off between perceptual acuity and general reasoning, but they do not delve deeply into the underlying causes of these issues. The paper does not explore the specific types of spatial reasoning tasks that are most challenging for the models, nor does it investigate the impact of different model architectures or training strategies on spatial reasoning performance. The paper states, "Spatial Localization emerges as the most significant bottleneck, with the top score being a mere 53.86%" (Section 4.2.6) and "The performance trade-off between google/gemini-2.5-flash and its vision-optimized variant (- image-preview) perfectly illustrates the specialization-generalization dilemma" (Section 4.2.5), but it does not provide a detailed analysis of why these issues occur. This weakness has a medium confidence level, as the paper does identify these issues, but lacks a deeper investigation. The impact of this weakness is that the paper does not provide a complete understanding of the models' limitations and does not offer clear guidance for future research. Fourth, the paper's focus on basketball, while providing a controlled environment, limits the generalizability of the findings. The authors acknowledge this limitation, stating, "Our benchmark is intentionally focused on professional basketball to create a controlled yet complex environment. While this specialization is a key strength, the findings may not directly generalize to other dynamic systems with different rules, physics, or agent behaviors" (Section 4.3). However, the paper does not explore the potential for adapting the BasketVision pipeline to other sports or structured domains. This weakness has a high confidence level, as the authors explicitly acknowledge the domain specificity. The impact of this weakness is that the benchmark may not be applicable to other dynamic systems, limiting its broader impact. Finally, the paper lacks a detailed discussion of the limitations of the automated data generation pipeline. While the authors mention a "systematic quality control process" (Section 3.3), they do not elaborate on the types of errors that the pipeline is prone to and how these errors might affect the evaluation results. This lack of transparency raises concerns about the reliability of the generated data. This weakness has a medium confidence level, as the paper mentions quality control but lacks details. The impact of this weakness is that the reliability of the generated data and the benchmark as a whole may be compromised.

💡 Suggestions

Based on the identified weaknesses, I propose several concrete suggestions for improving this work. First, to address the limitations of the template-based question generation, I recommend exploring methods to generate more diverse and less predictable questions. This could involve incorporating techniques from natural language generation to create questions that are not based on fixed templates. For example, the authors could explore using large language models to generate questions based on a high-level description of the scene, rather than relying on predefined templates. Additionally, the authors should provide a detailed analysis of the types of questions generated, including the distribution of question types and the complexity of the reasoning required to answer them. This would help to demonstrate that the benchmark is not simply testing pattern matching abilities but is truly evaluating the models' understanding of the basketball domain. Second, to improve the evaluation methodology, I suggest providing more details on the multiple-choice options and the open-ended question evaluation process. The authors should provide examples of multiple-choice options and explain how they are designed to avoid biases. For the open-ended questions, the authors should provide a detailed description of the annotation process, including the specific rubric used, the training of the annotators, and the measures taken to ensure inter-annotator agreement. This would increase the transparency and reliability of the evaluation. Additionally, the authors should consider incorporating more fine-grained evaluation metrics that can capture the models' understanding of the spatial and temporal aspects of the game. For example, they could measure the models' ability to track the movement of players and the ball, or their ability to predict future events based on the current state of the game. Third, to enhance the analysis of the results, I recommend conducting a more in-depth investigation into the underlying causes of the identified limitations. The authors should explore the specific types of spatial reasoning tasks that are most challenging for the models and investigate the impact of different model architectures or training strategies on spatial reasoning performance. This could involve conducting ablation studies to determine which components of the models are most important for spatial reasoning. Additionally, the authors should analyze the types of errors that the models make and identify any patterns or biases in these errors. This would provide a more complete understanding of the models' limitations and would help to guide future research in this area. Fourth, to address the limited generalizability of the benchmark, I suggest exploring the potential for adapting the BasketVision pipeline to other sports or structured domains. The authors could investigate how the computer vision techniques used for court recognition and player detection could be modified to work with different types of environments and objects. This would increase the impact of the work and make the benchmark more broadly applicable. Finally, to address the lack of detail on the automated data generation pipeline, I recommend providing a more thorough discussion of its limitations and potential failure cases. The authors should elaborate on the types of errors that the pipeline is prone to and how these errors might affect the evaluation results. This would increase the transparency and reliability of the benchmark. Additionally, the authors should consider incorporating human verification of a subset of the generated data to ensure its accuracy and consistency. By addressing these weaknesses, the authors can significantly improve the quality and impact of their work.

❓ Questions

I have several questions that arise from my analysis of this paper. First, regarding the template-based question generation, I am curious about the specific process used to create the domain-specific templates. What criteria were used to design these templates, and how was their diversity and coverage of the basketball domain ensured? Second, concerning the evaluation of open-ended questions, I would like to know more about the specific rubric used by the human annotators. What were the key criteria used to assess the quality of the responses, and how was inter-annotator agreement measured and ensured? Third, regarding the analysis of the results, I am interested in understanding the specific types of spatial reasoning tasks that were most challenging for the models. Were there any patterns or commonalities among the errors made by the models in these tasks? Fourth, concerning the automated data generation pipeline, I would like to know more about the types of errors that the pipeline is prone to. What are the limitations of the computer vision techniques used, and how might these limitations affect the quality of the generated data? Finally, regarding the generalizability of the benchmark, I am curious about the potential for adapting the BasketVision pipeline to other sports or structured domains. What modifications would be required to apply the pipeline to different types of environments and objects, and what challenges might arise in this process? These questions are aimed at clarifying key methodological choices and assumptions, and I believe that addressing them would further strengthen the paper's contributions.

📊 Scores

Soundness:2.5
Presentation:2.5
Contribution:2.5
Rating: 5.0

Keywords

Click the button to extract keywords

Insights

Click the button to extract insights
Version 1
Citation Tools

📝 Cite This Paper