Robix: A Unified Model for Robot Interaction, Reasoning and Planning
Abstract
Robix, a unified vision-language model, integrates robot reasoning, task planning, and natural language interaction, demonstrating superior performance in interactive task execution through chain-of-thought reasoning and a three-stage training strategy.
We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.
Community
Video DEMO: https://www.youtube.com/embed/-uEDN31Ne_Y
The main features of Robix are summarized as follows:
🌟 Unified model. Robix is a single vision-language model that unifies robot reasoning, task planning, and human-robot interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally in an end-to-end manner.
🌟 Flexible interaction. Within this unified framework, Robix supports proactive dialogue to clarify ambiguity and infer user intent, real-time interruption handling that seamlessly incorporates feedback, and context-aware commonsense reasoning for complex, open-ended tasks.
🌟 Robust Performance. We assess Robix in two setups: (i) on a curated interactive-task benchmark covering both in- and out-of-distribution scenarios with diverse instruction types, and (ii) across five real-world scenarios in a hierarchical robot system with both human teleoperation and an automatic VLA model as the low-level controller. These evaluations demonstrate that Robix consistently delivers strong performance across all settings.

is it a VLA or VLM?
Also why is it being compared to 4o & 2.5 Pro instead of robot specific models!?
Robix is a vision–language model (VLM) designed for unified robotic task planning and natural human interaction. In our experiments, we compare it against recent embodied models such as Cosmos-Reason1 and RoboBrain-2.0. Since our focus is on modeling complex interactive processes, there are currently no other open-source models that serve as suitable baselines. From our results, however, large commercial models like Gemini-2.5-Pro and GPT-4o demonstrate stronger performance in capturing complex multimodal interactions, making them more competitive references.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control (2025)
- EmbRACE-3K: Embodied Reasoning and Action in Complex Environments (2025)
- MolmoAct: Action Reasoning Models that can Reason in Space (2025)
- Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey (2025)
- Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning (2025)
- ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models (2025)
- Foundation Model Driven Robotics: A Comprehensive Review (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper