Ee-mllm: A data-efficient and compute-efficient multimodal large language model

Ma, F · 2024 · arXiv 2408.11795

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

cs.CV · 2025-02-06 · unverdicted · novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

The Hidden Evolution of Disguised Visual Context inside the VLM

cs.CV · 2026-06-18 · unverdicted · novelty 5.0

Visual tokens enter VLMs as raw signals and are reshaped differently by in-context versus layer-injection paradigms, each capturing distinct frequency characteristics that drive task performance.

citing papers explorer

Showing 2 of 2 citing papers.

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 47
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
The Hidden Evolution of Disguised Visual Context inside the VLM cs.CV · 2026-06-18 · unverdicted · none · ref 49
Visual tokens enter VLMs as raw signals and are reshaped differently by in-context versus layer-injection paradigms, each capturing distinct frequency characteristics that drive task performance.

Ee-mllm: A data-efficient and compute-efficient multimodal large language model

fields

years

verdicts

representative citing papers

citing papers explorer