📋 AI Review from DeepReviewer will be automatically processed
📋 AI Review from ZGCA will be automatically processed
The paper introduces BasketVision, a multimodal benchmark to evaluate MLLMs' understanding of complex dynamic systems using professional basketball as a structured microcosm. The benchmark comprises 6,000 curated, bilingual questions spanning seven capability dimensions (Scene Comprehension, Object Detection, Spatial Localization, Event Analysis, Context Understanding, Tracking & Trajectory Analysis, Reasoning & Strategy Analysis) with both image and video inputs (Table 1; Fig. 2). The authors propose an automated data generation pipeline integrating court recognition and homography (Alg. 1–2; Fig. 3–4), player detection/ReID (Alg. 3; Fig. 5), and projection onto a standard 2D court (Alg. 4). Data are drawn from NBA/CBA playoffs and multiple Olympic Games (Sec. 3.3), yielding 2,400 images and 1,200 video segments, and over 2,300 minutes of gameplay. The evaluation spans 23 models (proprietary/open-source; Sec. 4.1), with a human expert baseline. Results (Table 4) show a large gap between human experts (96.34%) and top model GPT-4o (63.15%), and identify Spatial Localization as the most challenging dimension (top score 53.86%). The paper analyzes task specialization and argues for architectural limitations in spatial reasoning.
Cross‑Modal Consistency: 28/50
Textual Logical Soundness: 20/30
Visual Aesthetics & Clarity: 14/20
Overall Score: 62/100
Detailed Evaluation (≤500 words):
Image‑First Understanding (visual ground truth)
• Figure 1/(a)(b)(c): Three‑pane pipeline. (a) data sourcing/QA; (b) automated modules (court recognition→projection→tracking→trajectory); (c) data check→manual annotation/validation→“6000 QA pairs”. Flow with icons, arrows, green/orange/blue palettes.
• Figure 2: Single composite showing seven task families; examples include film‑strip frames, MC options, short answers; colored boxes/headings.
• Figure 3: Broadcast frame with detected lines/keypoints/bboxes overlaid; scoreboard/shot‑clock visible; turquoise court polylines.
• Figure 4: Mosaic of Olympics clips; top: original broadcasts; bottom: top‑down projections; no axis/legend.
• Figure 5: Two rows of detections (bboxes+poses) and right‑side 2D court scatter plots of projected player positions.
Synopsis: Figs 1–5 illustrate the pipeline (detect→project→track) and qualitative examples/projections used to generate and evaluate QA pairs across tasks.
1. Cross‑Modal Consistency
• Major 1: Table reference mismatch. Evidence: Sec 3.5.1 “Table 2 provides a detailed statistical breakdown.” → Table 2 actually lists model accuracies, not dataset stats.
• Major 2: Dataset size inconsistency. Evidence: Sec 3.3 “dataset comprises 6,000 high‑quality images…1,200 video segments”; Table 3 shows “Image: 2400, Video: 1200”.
• Major 3: Figure naming inconsistency. Evidence: Fig. 2 title panel uses “BasketBench” while paper’s benchmark is BasketVision.
• Major 4: Redundant/confusing tables. Evidence: Sec 4.2 “Table 4: Detailed performance…”; but Table 2 contains the same per‑model accuracy matrix.
• Minor 1: Task naming drift. Evidence: Fig. 2 uses “Video‑Context Understanding”; Table 1 uses “Context Understanding”.
• Minor 2: Fig. 4 lacks per‑pane year labels or top/bottom annotations; text mentions four Olympics.
2. Text Logic
• Major 1: None blocking the main results chain (pipeline→benchmark→evaluation).
• Minor 1: No inter‑annotator agreement reported for open‑ended scoring. Evidence: Sec 4.1 rubric described, but κ/α not given.
• Minor 2: Claims of “fine‑grained precision” lack quantitative QC metrics (e.g., homography error, ReID accuracy). Evidence: Sec 3.5.1/3.4.5 narrative only.
3. Figure Quality
• Major 1: No Major issues found.
• Minor 1: Small fonts within Figs 1–2 (box texts, options) are likely illegible at print size.
• Minor 2 (Figure‑Alone Test): Fig. 4/5 need legends: label top vs. projection; add team‑color/ID legend and coordinate axes; Fig. 3 should mark detected keypoints/lines with a legend.
Key strengths:
• Clear end‑to‑end pipeline with homography‑based grounding and ReID.
• Comprehensive, domain‑structured evaluation across seven dimensions with bilingual QA.
• Results analysis ties task specialization to architectural limits; several claims numerically supported (e.g., GPT‑4o 63.15%, SL bottleneck 53.86%).
Key weaknesses:
• Serious figure–text/table mismatches (Tables 2/3/4; Fig. 2 naming).
• Dataset scale inconsistencies reduce credibility.
• Missing quantitative QC for pipeline accuracy; limited figure legends hinder stand‑alone comprehension.
📋 AI Review from SafeReviewer will be automatically processed
This paper introduces BasketVision, a novel benchmark designed to evaluate the capabilities of Multimodal Large Language Models (MLLMs) in understanding complex dynamic systems, using professional basketball as a testbed. The authors propose a comprehensive evaluation framework spanning seven dimensions, including scene comprehension, object detection, spatial localization, event analysis, context understanding, tracking and trajectory analysis, and reasoning and strategy analysis. To facilitate this evaluation, they present an automated data generation pipeline that leverages computer vision techniques to extract relevant information from basketball game footage, generating a dataset of over 6,000 curated questions in both image and video formats. The core contribution of this work lies in the creation of a specialized benchmark that addresses a gap in existing evaluations, which often focus on static images or generic video content. The authors' methodology involves a combination of automated processes, including court recognition, player detection, and tracking, coupled with manual annotation to ensure the quality and accuracy of the generated questions and answers. The empirical findings reveal a significant performance gap between current MLLMs and human experts, highlighting the challenges these models face in spatial reasoning and strategic understanding within dynamic environments. Specifically, the results demonstrate that while MLLMs perform reasonably well on tasks such as event analysis, they struggle with tasks requiring precise spatial localization and complex reasoning about game strategies. The authors also observe a trade-off between perceptual acuity and general reasoning, indicating that models optimized for specific perceptual tasks may underperform in broader reasoning tasks. The overall significance of this work is the introduction of a challenging and domain-specific benchmark that can serve as a valuable tool for the community to assess and improve the capabilities of MLLMs in understanding dynamic visual environments. The BasketVision benchmark provides a more rigorous testbed than existing benchmarks, pushing the boundaries of MLLM capabilities in sports analytics and dynamic scene understanding. The authors' work also highlights the need for further research into developing models that can effectively integrate visual and textual information to reason about complex, dynamic systems. The paper's findings suggest that current MLLMs, despite their advancements, still have significant limitations in understanding the nuances of dynamic environments, particularly in tasks requiring spatial reasoning and strategic analysis. This work provides a valuable contribution to the field by identifying these limitations and providing a benchmark for future research to address them.
I find several aspects of this paper to be particularly strong. The most notable is the introduction of BasketVision as a novel benchmark specifically designed to evaluate MLLMs in the context of a complex dynamic system. This is a significant contribution, as it addresses a gap in existing benchmarks that often focus on static images or generic video content. The choice of professional basketball as a testbed is also well-justified, given the sport's structured, multi-agent nature and the need for models to understand spatial, temporal, and strategic elements. The authors' decision to evaluate models across seven distinct dimensions, including scene comprehension, object detection, spatial localization, event analysis, context understanding, tracking and trajectory analysis, and reasoning and strategy analysis, provides a comprehensive assessment of MLLM capabilities. This multi-faceted approach allows for a more nuanced understanding of the strengths and weaknesses of different models. Furthermore, the development of an automated data generation pipeline is a significant technical innovation. This pipeline, which includes court recognition, perspective transformation, and player tracking, enables the creation of a large-scale dataset with fine-grained precision. The use of both image and video data further enhances the complexity and realism of the benchmark. The paper also presents a thorough evaluation of 23 MLLMs, providing valuable insights into their performance across the different evaluation dimensions. The results clearly demonstrate the challenges that current MLLMs face in spatial reasoning and strategic understanding, highlighting the need for further research in these areas. The paper's clear and well-structured presentation also contributes to its strengths. The authors effectively communicate their methodology, findings, and implications, making the paper accessible to a broad audience. The inclusion of detailed experimental results and analysis further strengthens the paper's credibility. The authors' work provides a valuable resource for the community, offering a challenging benchmark that can drive future research in MLLM development and evaluation. The paper's focus on a specific, complex domain like basketball also allows for a more targeted analysis of model capabilities, which is a significant advantage over more general benchmarks. The authors have successfully created a valuable tool for assessing and improving the capabilities of MLLMs in understanding dynamic visual environments.
Despite the strengths of this paper, I have identified several weaknesses that warrant careful consideration. First, the paper's reliance on a template-based approach for question generation is a significant limitation. While the authors mention using "domain-specific templates" (Section 3.4.5), they do not provide details on the diversity of these templates or methods to ensure the questions are not simply variations of a few base patterns. This raises concerns about the potential for the benchmark to be gamed by models that learn to recognize and respond to these templates rather than demonstrating a genuine understanding of the underlying basketball dynamics. The lack of detail on template creation and validation further exacerbates this concern. This weakness is supported by the absence of any discussion on template design or validation in the method section, and the statement about using "domain-specific templates" without further elaboration. This limitation has a high confidence level, as it is directly observable from the paper's content. The impact of this weakness is that the benchmark may not fully assess the models' understanding of complex basketball scenarios and may lead to inflated performance metrics. Second, the paper's evaluation methodology, while comprehensive, has some limitations. The use of multiple-choice questions, while mitigating numerical hallucination, may not fully capture the models' understanding. The paper does not provide examples of multiple-choice options, making it difficult to assess the difficulty and potential biases in the questions. Additionally, the evaluation of open-ended questions relies on human annotators, but the paper lacks details on the annotation process, such as the specific rubric used and the measures taken to ensure inter-annotator agreement. This lack of transparency raises concerns about the subjectivity and consistency of the evaluation. The paper mentions that "Two trained annotators...evaluated each generated response...In cases of disagreement, a third senior annotator made the final decision" (Section 4.1), but it does not provide details on the training of the annotators, the specific rubric used, or any quantitative measures of inter-annotator agreement. This weakness has a medium confidence level, as the paper does mention human evaluation, but lacks crucial details. The impact of this weakness is that the evaluation results may be subjective and not fully representative of the models' true capabilities. Third, the paper's analysis of the results, while insightful, could be more in-depth. The authors identify spatial reasoning as a significant bottleneck and observe a trade-off between perceptual acuity and general reasoning, but they do not delve deeply into the underlying causes of these issues. The paper does not explore the specific types of spatial reasoning tasks that are most challenging for the models, nor does it investigate the impact of different model architectures or training strategies on spatial reasoning performance. The paper states, "Spatial Localization emerges as the most significant bottleneck, with the top score being a mere 53.86%" (Section 4.2.6) and "The performance trade-off between google/gemini-2.5-flash and its vision-optimized variant (- image-preview) perfectly illustrates the specialization-generalization dilemma" (Section 4.2.5), but it does not provide a detailed analysis of why these issues occur. This weakness has a medium confidence level, as the paper does identify these issues, but lacks a deeper investigation. The impact of this weakness is that the paper does not provide a complete understanding of the models' limitations and does not offer clear guidance for future research. Fourth, the paper's focus on basketball, while providing a controlled environment, limits the generalizability of the findings. The authors acknowledge this limitation, stating, "Our benchmark is intentionally focused on professional basketball to create a controlled yet complex environment. While this specialization is a key strength, the findings may not directly generalize to other dynamic systems with different rules, physics, or agent behaviors" (Section 4.3). However, the paper does not explore the potential for adapting the BasketVision pipeline to other sports or structured domains. This weakness has a high confidence level, as the authors explicitly acknowledge the domain specificity. The impact of this weakness is that the benchmark may not be applicable to other dynamic systems, limiting its broader impact. Finally, the paper lacks a detailed discussion of the limitations of the automated data generation pipeline. While the authors mention a "systematic quality control process" (Section 3.3), they do not elaborate on the types of errors that the pipeline is prone to and how these errors might affect the evaluation results. This lack of transparency raises concerns about the reliability of the generated data. This weakness has a medium confidence level, as the paper mentions quality control but lacks details. The impact of this weakness is that the reliability of the generated data and the benchmark as a whole may be compromised.
Based on the identified weaknesses, I propose several concrete suggestions for improving this work. First, to address the limitations of the template-based question generation, I recommend exploring methods to generate more diverse and less predictable questions. This could involve incorporating techniques from natural language generation to create questions that are not based on fixed templates. For example, the authors could explore using large language models to generate questions based on a high-level description of the scene, rather than relying on predefined templates. Additionally, the authors should provide a detailed analysis of the types of questions generated, including the distribution of question types and the complexity of the reasoning required to answer them. This would help to demonstrate that the benchmark is not simply testing pattern matching abilities but is truly evaluating the models' understanding of the basketball domain. Second, to improve the evaluation methodology, I suggest providing more details on the multiple-choice options and the open-ended question evaluation process. The authors should provide examples of multiple-choice options and explain how they are designed to avoid biases. For the open-ended questions, the authors should provide a detailed description of the annotation process, including the specific rubric used, the training of the annotators, and the measures taken to ensure inter-annotator agreement. This would increase the transparency and reliability of the evaluation. Additionally, the authors should consider incorporating more fine-grained evaluation metrics that can capture the models' understanding of the spatial and temporal aspects of the game. For example, they could measure the models' ability to track the movement of players and the ball, or their ability to predict future events based on the current state of the game. Third, to enhance the analysis of the results, I recommend conducting a more in-depth investigation into the underlying causes of the identified limitations. The authors should explore the specific types of spatial reasoning tasks that are most challenging for the models and investigate the impact of different model architectures or training strategies on spatial reasoning performance. This could involve conducting ablation studies to determine which components of the models are most important for spatial reasoning. Additionally, the authors should analyze the types of errors that the models make and identify any patterns or biases in these errors. This would provide a more complete understanding of the models' limitations and would help to guide future research in this area. Fourth, to address the limited generalizability of the benchmark, I suggest exploring the potential for adapting the BasketVision pipeline to other sports or structured domains. The authors could investigate how the computer vision techniques used for court recognition and player detection could be modified to work with different types of environments and objects. This would increase the impact of the work and make the benchmark more broadly applicable. Finally, to address the lack of detail on the automated data generation pipeline, I recommend providing a more thorough discussion of its limitations and potential failure cases. The authors should elaborate on the types of errors that the pipeline is prone to and how these errors might affect the evaluation results. This would increase the transparency and reliability of the benchmark. Additionally, the authors should consider incorporating human verification of a subset of the generated data to ensure its accuracy and consistency. By addressing these weaknesses, the authors can significantly improve the quality and impact of their work.
I have several questions that arise from my analysis of this paper. First, regarding the template-based question generation, I am curious about the specific process used to create the domain-specific templates. What criteria were used to design these templates, and how was their diversity and coverage of the basketball domain ensured? Second, concerning the evaluation of open-ended questions, I would like to know more about the specific rubric used by the human annotators. What were the key criteria used to assess the quality of the responses, and how was inter-annotator agreement measured and ensured? Third, regarding the analysis of the results, I am interested in understanding the specific types of spatial reasoning tasks that were most challenging for the models. Were there any patterns or commonalities among the errors made by the models in these tasks? Fourth, concerning the automated data generation pipeline, I would like to know more about the types of errors that the pipeline is prone to. What are the limitations of the computer vision techniques used, and how might these limitations affect the quality of the generated data? Finally, regarding the generalizability of the benchmark, I am curious about the potential for adapting the BasketVision pipeline to other sports or structured domains. What modifications would be required to apply the pipeline to different types of environments and objects, and what challenges might arise in this process? These questions are aimed at clarifying key methodological choices and assumptions, and I believe that addressing them would further strengthen the paper's contributions.