Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.
Embodied-reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
The Estimate-Verify-Update (EVU) mechanism reduces belief inertia in embodied agents and raises task success rates on three benchmarks.
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
citing papers explorer
-
Why MLLMs Struggle to Determine Object Orientations
Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.
-
Seeing Isn't Believing: Mitigating Belief Inertia via Active Intervention in Embodied Agents
The Estimate-Verify-Update (EVU) mechanism reduces belief inertia in embodied agents and raises task success rates on three benchmarks.
-
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
-
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.
-
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.