Planning with Reasoning using Vision Language World Model
Abstract
The Vision Language World Model (VLWM) achieves state-of-the-art performance in visual planning by integrating language-based world modeling, action policy learning, and dynamics modeling with semantic and temporal abstraction.
Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM Self-Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan decoding and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.
Community
Seems like the way we are heading right now, levelling up to world models or world state models seems to be goal for us humans...
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MolmoAct: Action Reasoning Models that can Reason in Space (2025)
- OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving (2025)
- ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving (2025)
- GoViG: Goal-Conditioned Visual Navigation Instruction Generation (2025)
- Ego-centric Predictive Model Conditioned on Hand Trajectories (2025)
- Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction (2025)
- EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper