MeniOmni: A Structured Multimodal Benchmark for Holistic Meniscus Injury Assessment
Pith reviewed 2026-06-29 13:23 UTC · model grok-4.3
The pith
A benchmark pairing MRI scans with patient details shows that clinical context improves meniscus injury grading and cuts severe mistakes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MeniOmni supplies 746 tri-planar volumetric MRI studies, Clinical Priors such as sex age and BMI, and expert-annotated clinical text. The benchmark supports Stoller severity grading and diagnostic report generation, evaluated with risk-aware ordinal metrics and a semantic consistency score called Meni-Score. Experiments demonstrate that models given the Clinical Priors outperform image-only baselines and produce fewer severe grading errors.
What carries the argument
MeniOmni benchmark that supplies tri-planar MRI volumes together with Clinical Priors for joint evaluation of grading accuracy and report quality.
If this is right
- Models supplied with Clinical Priors achieve higher accuracy on Stoller grading than image-only models.
- The rate of severe grading errors drops when patient context is included.
- Multimodal inputs enable evaluation of clinical reasoning that combines image and non-image evidence.
- The two tasks and new metrics provide a structured way to compare systems on report generation quality.
Where Pith is reading between the lines
- Similar multimodal pairing of scans and priors could be applied to other joint structures or to ligament and cartilage assessment.
- The reported gains would need confirmation on data from additional scanner vendors or patient populations.
- If the pattern holds, clinical systems might reduce diagnostic variability by routinely surfacing patient context alongside images.
Load-bearing premise
The expert-provided Stoller grades and clinical text annotations on these 746 studies serve as reliable ground truth for measuring holistic reasoning performance.
What would settle it
Re-running the baselines on a fresh set of cases where final clinical outcomes or independent radiologist re-reads replace the original annotations would show no accuracy gain or error reduction from adding Clinical Priors.
Figures
read the original abstract
Clinical diagnosis of meniscus injuries requires radiologists to integrate volumetric MRI evidence with patient context (e.g., sex, age, BMI) and to produce structured diagnostic reports. Existing knee MRI benchmarks are typically unimodal and rely on coarse labels, limiting their ability to evaluate holistic clinical reasoning. We introduce MeniOmni, a structured multimodal benchmark for meniscus injury assessment, consisting of 746 multi-center MRI studies with tri-planar volumetric inputs, Clinical Priors, and expert-annotated clinical text. MeniOmni supports two tasks: (1) fine-grained Stoller severity grading and (2) diagnostic report generation. We further propose risk-aware ordinal evaluation and a semantic consistency metric (Meni-Score) to better reflect clinical relevance. Baseline experiments show that incorporating Clinical Priors improves grading performance and reduces severe errors, highlighting the value of multimodal context for safer assessment. Code and data are available at https://github.com/ShuruiXu/MeniOmni.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MeniOmni, a structured multimodal benchmark consisting of 746 multi-center knee MRI studies with tri-planar volumetric inputs, clinical priors (e.g., sex, age, BMI), and expert-annotated clinical text. It defines two tasks—fine-grained Stoller severity grading and diagnostic report generation—along with risk-aware ordinal evaluation and a semantic consistency metric (Meni-Score). Baseline experiments indicate that incorporating clinical priors improves grading performance and reduces severe errors.
Significance. If the annotations are shown to be reliable, the benchmark could advance multimodal medical AI by supplying structured data and clinically aligned metrics that go beyond existing unimodal knee MRI datasets. The explicit release of code and data is a clear strength supporting reproducibility.
major comments (2)
- Dataset construction (as described in the abstract and implied methods): No inter-annotator agreement, annotation protocol details, or correlation with surgical findings are reported for the expert-annotated Stoller grades and clinical text on the 746 studies. This directly undermines the central claim that 'incorporating Clinical Priors improves grading performance and reduces severe errors,' because measured gains could arise from fitting to unvalidated or inconsistent labels rather than genuine multimodal reasoning.
- Baseline experiments (abstract): The description provides no information on model architectures, data splits, statistical significance tests, or controls for selection biases. These omissions are load-bearing for interpreting the reported improvements from clinical priors and prevent verification of the soundness of the empirical results.
minor comments (2)
- Clarify the exact variables included in 'Clinical Priors' beyond the examples given, and provide the precise definition or formula for the Meni-Score metric to support reproducibility.
- Add a dedicated comparison table against existing knee MRI benchmarks to highlight differences in modality, label granularity, and evaluation approach.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comments point-by-point below, clarifying the manuscript content and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: Dataset construction (as described in the abstract and implied methods): No inter-annotator agreement, annotation protocol details, or correlation with surgical findings are reported for the expert-annotated Stoller grades and clinical text on the 746 studies. This directly undermines the central claim that 'incorporating Clinical Priors improves grading performance and reduces severe errors,' because measured gains could arise from fitting to unvalidated or inconsistent labels rather than genuine multimodal reasoning.
Authors: We agree that explicit documentation of the annotation process is essential for a benchmark paper. The full manuscript states that Stoller grades and clinical text were produced by board-certified radiologists following established clinical criteria, with a high-level description of the protocol. However, inter-annotator agreement statistics and correlation with surgical findings were not collected during dataset creation. We will add a dedicated subsection detailing the annotation workflow and will revise the discussion to explicitly note these validation limitations. The reported improvements from clinical priors are empirical observations on the given labels; we will qualify the central claim to reflect the current level of label validation. revision: partial
-
Referee: Baseline experiments (abstract): The description provides no information on model architectures, data splits, statistical significance tests, or controls for selection biases. These omissions are load-bearing for interpreting the reported improvements from clinical priors and prevent verification of the soundness of the empirical results.
Authors: The full manuscript contains dedicated experimental sections that specify the multimodal model architectures, the train/validation/test splits (stratified by center to address selection bias), and the statistical tests used to assess improvements. These details were omitted from the abstract for brevity. We will expand the abstract and methods to include this information explicitly and will add a short paragraph on bias controls. revision: yes
- Inter-annotator agreement scores and correlation with surgical findings were never collected, so numerical values for these cannot be supplied.
Circularity Check
No circularity: benchmark paper with no derivations or fitted predictions
full rationale
The manuscript introduces MeniOmni as a dataset and benchmark supporting two tasks (Stoller grading and report generation) together with baseline experiments and new metrics. No equations, parameter fits, uniqueness theorems, or predictions appear anywhere in the provided text. All reported results are direct empirical measurements on the released 746-study collection; the central claim that Clinical Priors improve performance is therefore an observable outcome rather than a quantity derived from itself. No self-citations are invoked to justify any load-bearing step.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Meniscal tears: current understanding, diagnosis, and management,
Kavyansh Bhan, “Meniscal tears: current understanding, diagnosis, and management,”Cureus, vol. 12, no. 6, pp. e8590, 2020
2020
-
[2]
Trends in meniscus repair and meniscectomy in the United States, 2005-2011,
Geoffrey D Abrams, Rachel M Frank, et al., “Trends in meniscus repair and meniscectomy in the United States, 2005-2011,”The American Journal of Sports Medicine, vol. 41, no. 10, pp. 2333–2339, 2013
2005
-
[3]
MRI versus arthroscopy in the diagnosis of knee pathology, concentrating on meniscal lesions and ACL tears: a systematic review,
Ruth Crawford, Gayle Walley, et al., “MRI versus arthroscopy in the diagnosis of knee pathology, concentrating on meniscal lesions and ACL tears: a systematic review,”British Medical Bulletin, vol. 84, no. 1, pp. 5–23, 2007
2007
-
[4]
Can we trust AI doctors? A survey of medical hallucination in large language and large vision-language models,
Zhihong Zhu, Yunyan Zhang, et al., “Can we trust AI doctors? A survey of medical hallucination in large language and large vision-language models,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 6748–6769
2025
-
[5]
Deep-learning-assisted diagno- sis for knee MRI: development and retrospective validation of MRNet,
Nicholas Bien, Pranav Rajpurkar, et al., “Deep-learning-assisted diagno- sis for knee MRI: development and retrospective validation of MRNet,” PLoS Medicine, vol. 15, no. 11, pp. e1002699, 2018
2018
-
[6]
fastMRI: An Open Dataset and Benchmarks for Accelerated MRI
Jure Zbontar, Florian Knoll, et al., “fastMRI: An open dataset and benchmarks for accelerated MRI,”arXiv preprint arXiv:1811.08839, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
fastMRI+, clinical pathol- ogy annotations for knee and brain fully sampled MRI data,
Ruiyang Zhao, Burhaneddin Yaman, et al., “fastMRI+, clinical pathol- ogy annotations for knee and brain fully sampled MRI data,”Scientific Data, vol. 9, no. 1, pp. 152, 2022
2022
-
[8]
Semi-automated detection of anterior cruciate ligament injury from MRI,
Ivan ˇStajduhar, Mihaela Mamula, et al., “Semi-automated detection of anterior cruciate ligament injury from MRI,”Computer Methods and Programs in Biomedicine, vol. 140, pp. 151–164, 2017
2017
-
[9]
MeniMV: A multi-view benchmark for meniscus injury severity grading,
Shurui Xu, Siqi Yang, et al., “MeniMV: A multi-view benchmark for meniscus injury severity grading,”arXiv preprint arXiv:2505.00000, 2025
-
[10]
Deep feature learning for knee cartilage segmentation using a triplanar CNN,
Adhish Prasoon, Kersten Petersen, et al., “Deep feature learning for knee cartilage segmentation using a triplanar CNN,” inProceedings of the International Conference on Medical Image Computing and Computer- Assisted Intervention (MICCAI), 2013, pp. 246–253
2013
-
[11]
nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation,
Fabian Isensee, Paul F Jaeger, et al., “nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation,”Nature Methods, vol. 18, no. 2, pp. 203–211, 2021
2021
-
[12]
Quality-driven deep active learning method for 3D brain MRI segmentation,
Zhenxi Zhang, Jie Li, et al., “Quality-driven deep active learning method for 3D brain MRI segmentation,”Neurocomputing, vol. 446, pp. 106– 117, 2021
2021
-
[13]
Llava-med: Training a large language- and-vision assistant for biomedicine in one day,
Chunyuan Li, Cliff Wong, et al., “Llava-med: Training a large language- and-vision assistant for biomedicine in one day,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, vol. 36, pp. 28541– 28564
2023
-
[14]
Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,
Chaoyi Wu, Xiaoman Zhang, et al., “Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,” Nature Communications, vol. 16, no. 1, pp. 7866, 2025
2025
-
[15]
Beyond the hype: A dispassionate look at vision-language models in medical scenario,
Yang Nan, Huichi Zhou, et al., “Beyond the hype: A dispassionate look at vision-language models in medical scenario,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 10, pp. 17623– 17634, 2025
2025
-
[16]
Junlong Cheng, Jin Ye, et al., “SAM-Med2D,”arXiv preprint arXiv:2308.16184, 2023
-
[17]
Image quality assessment: from error visibility to structural similarity,
Zhou Wang, Alan C Bovik, et al., “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004
2004
-
[18]
X3D: Expanding architectures for efficient video recognition,
Christoph Feichtenhofer, “X3D: Expanding architectures for efficient video recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 203–213
2020
-
[19]
ViViT: A video vision transformer,
Anurag Arnab, Mostafa Dehghani, et al., “ViViT: A video vision transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 6836–6846
2021
-
[20]
Video Swin transformer,
Ze Liu, Jia Ning, et al., “Video Swin transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3202–3211
2022
-
[21]
Video-llava: Learning united visual rep- resentation by alignment before projection,
Bin Lin, Yang Ye, et al., “Video-llava: Learning united visual rep- resentation by alignment before projection,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 5971–5984
2024
-
[22]
InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,
Zhe Chen, Jiannan Wu, et al., “InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024, pp. 24185–24198
2024
-
[23]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, et al., “Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.