MeniOmni: A Structured Multimodal Benchmark for Holistic Meniscus Injury Assessment

Hui Wang; Mengzhen Fan; Shurui Xu; Shuyan Li; Siqi Yang; Weiping Ding; Yuyu Sun

arxiv: 2605.28161 · v1 · pith:SJVHERAJnew · submitted 2026-05-27 · 💻 cs.CV

MeniOmni: A Structured Multimodal Benchmark for Holistic Meniscus Injury Assessment

Shurui Xu , Siqi Yang , Weiping Ding , Hui Wang , Mengzhen Fan , Yuyu Sun , Shuyan Li This is my paper

Pith reviewed 2026-06-29 13:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords meniscus injurymultimodal benchmarkknee MRIStoller gradingclinical priorsdiagnostic report generationmedical imaging evaluationAI for radiology

0 comments

The pith

A benchmark pairing MRI scans with patient details shows that clinical context improves meniscus injury grading and cuts severe mistakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MeniOmni, a collection of 746 multi-center knee MRI studies that supplies volumetric images together with patient information and expert text to test integrated diagnosis of meniscus tears. It defines two concrete tasks—fine-grained Stoller severity grading and structured report generation—plus evaluation metrics that treat clinically dangerous errors as more costly. Baseline runs indicate that feeding the patient context into models raises grading accuracy and lowers the rate of high-severity misclassifications compared with image-only inputs.

Core claim

MeniOmni supplies 746 tri-planar volumetric MRI studies, Clinical Priors such as sex age and BMI, and expert-annotated clinical text. The benchmark supports Stoller severity grading and diagnostic report generation, evaluated with risk-aware ordinal metrics and a semantic consistency score called Meni-Score. Experiments demonstrate that models given the Clinical Priors outperform image-only baselines and produce fewer severe grading errors.

What carries the argument

MeniOmni benchmark that supplies tri-planar MRI volumes together with Clinical Priors for joint evaluation of grading accuracy and report quality.

If this is right

Models supplied with Clinical Priors achieve higher accuracy on Stoller grading than image-only models.
The rate of severe grading errors drops when patient context is included.
Multimodal inputs enable evaluation of clinical reasoning that combines image and non-image evidence.
The two tasks and new metrics provide a structured way to compare systems on report generation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multimodal pairing of scans and priors could be applied to other joint structures or to ligament and cartilage assessment.
The reported gains would need confirmation on data from additional scanner vendors or patient populations.
If the pattern holds, clinical systems might reduce diagnostic variability by routinely surfacing patient context alongside images.

Load-bearing premise

The expert-provided Stoller grades and clinical text annotations on these 746 studies serve as reliable ground truth for measuring holistic reasoning performance.

What would settle it

Re-running the baselines on a fresh set of cases where final clinical outcomes or independent radiologist re-reads replace the original annotations would show no accuracy gain or error reduction from adding Clinical Priors.

Figures

Figures reproduced from arXiv: 2605.28161 by Hui Wang, Mengzhen Fan, Shurui Xu, Shuyan Li, Siqi Yang, Weiping Ding, Yuyu Sun.

**Figure 2.** Figure 2: Schematic illustration of the proposed content-adaptive MRI slice [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative Analysis. The integration of clinical priors (Age: 26) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Clinical diagnosis of meniscus injuries requires radiologists to integrate volumetric MRI evidence with patient context (e.g., sex, age, BMI) and to produce structured diagnostic reports. Existing knee MRI benchmarks are typically unimodal and rely on coarse labels, limiting their ability to evaluate holistic clinical reasoning. We introduce MeniOmni, a structured multimodal benchmark for meniscus injury assessment, consisting of 746 multi-center MRI studies with tri-planar volumetric inputs, Clinical Priors, and expert-annotated clinical text. MeniOmni supports two tasks: (1) fine-grained Stoller severity grading and (2) diagnostic report generation. We further propose risk-aware ordinal evaluation and a semantic consistency metric (Meni-Score) to better reflect clinical relevance. Baseline experiments show that incorporating Clinical Priors improves grading performance and reduces severe errors, highlighting the value of multimodal context for safer assessment. Code and data are available at https://github.com/ShuruiXu/MeniOmni.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MeniOmni releases a new multimodal knee MRI benchmark with clinical priors and custom metrics, but the ground truth labels lack any reported validation.

read the letter

The main thing to know is that this paper puts out a dataset of 746 multi-center MRI studies with tri-planar volumes, patient context like age and BMI, and expert text, aimed at two tasks: fine-grained Stoller grading and report generation. It also defines risk-aware ordinal scoring and a Meni-Score for semantic consistency. That fills a stated gap in existing unimodal knee benchmarks.

The work does a straightforward job of making the data and code public and showing that adding clinical priors to baselines cuts severe errors in grading. Releasing the resource itself is the concrete step forward.

The soft spots sit in the evidence and the labels. The abstract claims the priors help, yet gives no model details, splits, or statistical tests, so the size of the gain stays unclear. More directly, the Stoller grades and expert text are treated as ground truth without any mention of inter-annotator agreement, annotation protocol, or correlation to surgical findings. If those labels contain noise or inconsistency, the measured improvements could simply reflect fitting to the provided annotations rather than better multimodal reasoning.

This is for researchers building or testing multimodal models on knee imaging who need a structured benchmark with patient context. A reader looking for ready-to-use data and tasks would find it useful.

It deserves peer review because the dataset is new and the multimodal framing is reasonable; referees can check the annotation quality and baseline details that the abstract leaves out.

Referee Report

2 major / 2 minor

Summary. The paper introduces MeniOmni, a structured multimodal benchmark consisting of 746 multi-center knee MRI studies with tri-planar volumetric inputs, clinical priors (e.g., sex, age, BMI), and expert-annotated clinical text. It defines two tasks—fine-grained Stoller severity grading and diagnostic report generation—along with risk-aware ordinal evaluation and a semantic consistency metric (Meni-Score). Baseline experiments indicate that incorporating clinical priors improves grading performance and reduces severe errors.

Significance. If the annotations are shown to be reliable, the benchmark could advance multimodal medical AI by supplying structured data and clinically aligned metrics that go beyond existing unimodal knee MRI datasets. The explicit release of code and data is a clear strength supporting reproducibility.

major comments (2)

Dataset construction (as described in the abstract and implied methods): No inter-annotator agreement, annotation protocol details, or correlation with surgical findings are reported for the expert-annotated Stoller grades and clinical text on the 746 studies. This directly undermines the central claim that 'incorporating Clinical Priors improves grading performance and reduces severe errors,' because measured gains could arise from fitting to unvalidated or inconsistent labels rather than genuine multimodal reasoning.
Baseline experiments (abstract): The description provides no information on model architectures, data splits, statistical significance tests, or controls for selection biases. These omissions are load-bearing for interpreting the reported improvements from clinical priors and prevent verification of the soundness of the empirical results.

minor comments (2)

Clarify the exact variables included in 'Clinical Priors' beyond the examples given, and provide the precise definition or formula for the Meni-Score metric to support reproducibility.
Add a dedicated comparison table against existing knee MRI benchmarks to highlight differences in modality, label granularity, and evaluation approach.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback. We address the major comments point-by-point below, clarifying the manuscript content and indicating planned revisions where appropriate.

read point-by-point responses

Referee: Dataset construction (as described in the abstract and implied methods): No inter-annotator agreement, annotation protocol details, or correlation with surgical findings are reported for the expert-annotated Stoller grades and clinical text on the 746 studies. This directly undermines the central claim that 'incorporating Clinical Priors improves grading performance and reduces severe errors,' because measured gains could arise from fitting to unvalidated or inconsistent labels rather than genuine multimodal reasoning.

Authors: We agree that explicit documentation of the annotation process is essential for a benchmark paper. The full manuscript states that Stoller grades and clinical text were produced by board-certified radiologists following established clinical criteria, with a high-level description of the protocol. However, inter-annotator agreement statistics and correlation with surgical findings were not collected during dataset creation. We will add a dedicated subsection detailing the annotation workflow and will revise the discussion to explicitly note these validation limitations. The reported improvements from clinical priors are empirical observations on the given labels; we will qualify the central claim to reflect the current level of label validation. revision: partial
Referee: Baseline experiments (abstract): The description provides no information on model architectures, data splits, statistical significance tests, or controls for selection biases. These omissions are load-bearing for interpreting the reported improvements from clinical priors and prevent verification of the soundness of the empirical results.

Authors: The full manuscript contains dedicated experimental sections that specify the multimodal model architectures, the train/validation/test splits (stratified by center to address selection bias), and the statistical tests used to assess improvements. These details were omitted from the abstract for brevity. We will expand the abstract and methods to include this information explicitly and will add a short paragraph on bias controls. revision: yes

standing simulated objections not resolved

Inter-annotator agreement scores and correlation with surgical findings were never collected, so numerical values for these cannot be supplied.

Circularity Check

0 steps flagged

No circularity: benchmark paper with no derivations or fitted predictions

full rationale

The manuscript introduces MeniOmni as a dataset and benchmark supporting two tasks (Stoller grading and report generation) together with baseline experiments and new metrics. No equations, parameter fits, uniqueness theorems, or predictions appear anywhere in the provided text. All reported results are direct empirical measurements on the released 746-study collection; the central claim that Clinical Priors improve performance is therefore an observable outcome rather than a quantity derived from itself. No self-citations are invoked to justify any load-bearing step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark creation paper; no free parameters, axioms, or invented entities are introduced beyond standard practices in dataset curation and evaluation metric design.

pith-pipeline@v0.9.1-grok · 5715 in / 1176 out tokens · 36377 ms · 2026-06-29T13:23:36.010659+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Meniscal tears: current understanding, diagnosis, and management,

Kavyansh Bhan, “Meniscal tears: current understanding, diagnosis, and management,”Cureus, vol. 12, no. 6, pp. e8590, 2020

2020
[2]

Trends in meniscus repair and meniscectomy in the United States, 2005-2011,

Geoffrey D Abrams, Rachel M Frank, et al., “Trends in meniscus repair and meniscectomy in the United States, 2005-2011,”The American Journal of Sports Medicine, vol. 41, no. 10, pp. 2333–2339, 2013

2005
[3]

MRI versus arthroscopy in the diagnosis of knee pathology, concentrating on meniscal lesions and ACL tears: a systematic review,

Ruth Crawford, Gayle Walley, et al., “MRI versus arthroscopy in the diagnosis of knee pathology, concentrating on meniscal lesions and ACL tears: a systematic review,”British Medical Bulletin, vol. 84, no. 1, pp. 5–23, 2007

2007
[4]

Can we trust AI doctors? A survey of medical hallucination in large language and large vision-language models,

Zhihong Zhu, Yunyan Zhang, et al., “Can we trust AI doctors? A survey of medical hallucination in large language and large vision-language models,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 6748–6769

2025
[5]

Deep-learning-assisted diagno- sis for knee MRI: development and retrospective validation of MRNet,

Nicholas Bien, Pranav Rajpurkar, et al., “Deep-learning-assisted diagno- sis for knee MRI: development and retrospective validation of MRNet,” PLoS Medicine, vol. 15, no. 11, pp. e1002699, 2018

2018
[6]

fastMRI: An Open Dataset and Benchmarks for Accelerated MRI

Jure Zbontar, Florian Knoll, et al., “fastMRI: An open dataset and benchmarks for accelerated MRI,”arXiv preprint arXiv:1811.08839, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

fastMRI+, clinical pathol- ogy annotations for knee and brain fully sampled MRI data,

Ruiyang Zhao, Burhaneddin Yaman, et al., “fastMRI+, clinical pathol- ogy annotations for knee and brain fully sampled MRI data,”Scientific Data, vol. 9, no. 1, pp. 152, 2022

2022
[8]

Semi-automated detection of anterior cruciate ligament injury from MRI,

Ivan ˇStajduhar, Mihaela Mamula, et al., “Semi-automated detection of anterior cruciate ligament injury from MRI,”Computer Methods and Programs in Biomedicine, vol. 140, pp. 151–164, 2017

2017
[9]

MeniMV: A multi-view benchmark for meniscus injury severity grading,

Shurui Xu, Siqi Yang, et al., “MeniMV: A multi-view benchmark for meniscus injury severity grading,”arXiv preprint arXiv:2505.00000, 2025

work page arXiv 2025
[10]

Deep feature learning for knee cartilage segmentation using a triplanar CNN,

Adhish Prasoon, Kersten Petersen, et al., “Deep feature learning for knee cartilage segmentation using a triplanar CNN,” inProceedings of the International Conference on Medical Image Computing and Computer- Assisted Intervention (MICCAI), 2013, pp. 246–253

2013
[11]

nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation,

Fabian Isensee, Paul F Jaeger, et al., “nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation,”Nature Methods, vol. 18, no. 2, pp. 203–211, 2021

2021
[12]

Quality-driven deep active learning method for 3D brain MRI segmentation,

Zhenxi Zhang, Jie Li, et al., “Quality-driven deep active learning method for 3D brain MRI segmentation,”Neurocomputing, vol. 446, pp. 106– 117, 2021

2021
[13]

Llava-med: Training a large language- and-vision assistant for biomedicine in one day,

Chunyuan Li, Cliff Wong, et al., “Llava-med: Training a large language- and-vision assistant for biomedicine in one day,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, vol. 36, pp. 28541– 28564

2023
[14]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,

Chaoyi Wu, Xiaoman Zhang, et al., “Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,” Nature Communications, vol. 16, no. 1, pp. 7866, 2025

2025
[15]

Beyond the hype: A dispassionate look at vision-language models in medical scenario,

Yang Nan, Huichi Zhou, et al., “Beyond the hype: A dispassionate look at vision-language models in medical scenario,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 10, pp. 17623– 17634, 2025

2025
[16]

Sam-med2d,

Junlong Cheng, Jin Ye, et al., “SAM-Med2D,”arXiv preprint arXiv:2308.16184, 2023

work page arXiv 2023
[17]

Image quality assessment: from error visibility to structural similarity,

Zhou Wang, Alan C Bovik, et al., “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004

2004
[18]

X3D: Expanding architectures for efficient video recognition,

Christoph Feichtenhofer, “X3D: Expanding architectures for efficient video recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 203–213

2020
[19]

ViViT: A video vision transformer,

Anurag Arnab, Mostafa Dehghani, et al., “ViViT: A video vision transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 6836–6846

2021
[20]

Video Swin transformer,

Ze Liu, Jia Ning, et al., “Video Swin transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3202–3211

2022
[21]

Video-llava: Learning united visual rep- resentation by alignment before projection,

Bin Lin, Yang Ye, et al., “Video-llava: Learning united visual rep- resentation by alignment before projection,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 5971–5984

2024
[22]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Zhe Chen, Jiannan Wu, et al., “InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024, pp. 24185–24198

2024
[23]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, et al., “Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Meniscal tears: current understanding, diagnosis, and management,

Kavyansh Bhan, “Meniscal tears: current understanding, diagnosis, and management,”Cureus, vol. 12, no. 6, pp. e8590, 2020

2020

[2] [2]

Trends in meniscus repair and meniscectomy in the United States, 2005-2011,

Geoffrey D Abrams, Rachel M Frank, et al., “Trends in meniscus repair and meniscectomy in the United States, 2005-2011,”The American Journal of Sports Medicine, vol. 41, no. 10, pp. 2333–2339, 2013

2005

[3] [3]

MRI versus arthroscopy in the diagnosis of knee pathology, concentrating on meniscal lesions and ACL tears: a systematic review,

Ruth Crawford, Gayle Walley, et al., “MRI versus arthroscopy in the diagnosis of knee pathology, concentrating on meniscal lesions and ACL tears: a systematic review,”British Medical Bulletin, vol. 84, no. 1, pp. 5–23, 2007

2007

[4] [4]

Can we trust AI doctors? A survey of medical hallucination in large language and large vision-language models,

Zhihong Zhu, Yunyan Zhang, et al., “Can we trust AI doctors? A survey of medical hallucination in large language and large vision-language models,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 6748–6769

2025

[5] [5]

Deep-learning-assisted diagno- sis for knee MRI: development and retrospective validation of MRNet,

Nicholas Bien, Pranav Rajpurkar, et al., “Deep-learning-assisted diagno- sis for knee MRI: development and retrospective validation of MRNet,” PLoS Medicine, vol. 15, no. 11, pp. e1002699, 2018

2018

[6] [6]

fastMRI: An Open Dataset and Benchmarks for Accelerated MRI

Jure Zbontar, Florian Knoll, et al., “fastMRI: An open dataset and benchmarks for accelerated MRI,”arXiv preprint arXiv:1811.08839, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

fastMRI+, clinical pathol- ogy annotations for knee and brain fully sampled MRI data,

Ruiyang Zhao, Burhaneddin Yaman, et al., “fastMRI+, clinical pathol- ogy annotations for knee and brain fully sampled MRI data,”Scientific Data, vol. 9, no. 1, pp. 152, 2022

2022

[8] [8]

Semi-automated detection of anterior cruciate ligament injury from MRI,

Ivan ˇStajduhar, Mihaela Mamula, et al., “Semi-automated detection of anterior cruciate ligament injury from MRI,”Computer Methods and Programs in Biomedicine, vol. 140, pp. 151–164, 2017

2017

[9] [9]

MeniMV: A multi-view benchmark for meniscus injury severity grading,

Shurui Xu, Siqi Yang, et al., “MeniMV: A multi-view benchmark for meniscus injury severity grading,”arXiv preprint arXiv:2505.00000, 2025

work page arXiv 2025

[10] [10]

Deep feature learning for knee cartilage segmentation using a triplanar CNN,

Adhish Prasoon, Kersten Petersen, et al., “Deep feature learning for knee cartilage segmentation using a triplanar CNN,” inProceedings of the International Conference on Medical Image Computing and Computer- Assisted Intervention (MICCAI), 2013, pp. 246–253

2013

[11] [11]

nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation,

Fabian Isensee, Paul F Jaeger, et al., “nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation,”Nature Methods, vol. 18, no. 2, pp. 203–211, 2021

2021

[12] [12]

Quality-driven deep active learning method for 3D brain MRI segmentation,

Zhenxi Zhang, Jie Li, et al., “Quality-driven deep active learning method for 3D brain MRI segmentation,”Neurocomputing, vol. 446, pp. 106– 117, 2021

2021

[13] [13]

Llava-med: Training a large language- and-vision assistant for biomedicine in one day,

Chunyuan Li, Cliff Wong, et al., “Llava-med: Training a large language- and-vision assistant for biomedicine in one day,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, vol. 36, pp. 28541– 28564

2023

[14] [14]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,

Chaoyi Wu, Xiaoman Zhang, et al., “Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,” Nature Communications, vol. 16, no. 1, pp. 7866, 2025

2025

[15] [15]

Beyond the hype: A dispassionate look at vision-language models in medical scenario,

Yang Nan, Huichi Zhou, et al., “Beyond the hype: A dispassionate look at vision-language models in medical scenario,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 10, pp. 17623– 17634, 2025

2025

[16] [16]

Sam-med2d,

Junlong Cheng, Jin Ye, et al., “SAM-Med2D,”arXiv preprint arXiv:2308.16184, 2023

work page arXiv 2023

[17] [17]

Image quality assessment: from error visibility to structural similarity,

Zhou Wang, Alan C Bovik, et al., “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004

2004

[18] [18]

X3D: Expanding architectures for efficient video recognition,

Christoph Feichtenhofer, “X3D: Expanding architectures for efficient video recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 203–213

2020

[19] [19]

ViViT: A video vision transformer,

Anurag Arnab, Mostafa Dehghani, et al., “ViViT: A video vision transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 6836–6846

2021

[20] [20]

Video Swin transformer,

Ze Liu, Jia Ning, et al., “Video Swin transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3202–3211

2022

[21] [21]

Video-llava: Learning united visual rep- resentation by alignment before projection,

Bin Lin, Yang Ye, et al., “Video-llava: Learning united visual rep- resentation by alignment before projection,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 5971–5984

2024

[22] [22]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Zhe Chen, Jiannan Wu, et al., “InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024, pp. 24185–24198

2024

[23] [23]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, et al., “Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024