# WALL-OSS

[![Paper](https://img.shields.io/badge/๐Ÿ“„%20Paper-PDF-EA1B22?style=for-the-badge&logo=adobeacrobatreader&logoColor=fff)](https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf)    [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-x--square--robot-FFB000?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/x-square-robot)    [![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=fff)](https://github.com/X-Square-Robot/wall-x)    [![Project Page](https://img.shields.io/badge/Project-1E90FF?style=for-the-badge&logo=google-chrome&logoColor=fff)](https://x2robot.com/en/research/68bc2cde8497d7f238dde690)
## WALL-OSS: Igniting VLMs toward the Embodied Space We introduce **WALL-OSS**, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision--language understanding, (2) strong language--action association, and (3) robust manipulation capability. Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoTโ€”seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework. Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models. ## ๐ŸŽฌ Video Demos

WALL-OSS in Action: Demonstrating advanced manipulation capabilities and embodied AI performance

## ๐Ÿš€ Quick Start ### Installation ```bash # Create conda environment conda create --name wallx python=3.10 conda activate wallx # Install base requirements pip install torch torchvision transformers pip install huggingface_hub # Install Wall-X from GitHub git clone https://github.com/X-Square-Robot/wall-x.git cd wall-x pip install -e . ``` ### Basic Usage ```python import torch from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction # Load the model model_path = "X-Square-Robot/wall-oss-flow" # or your local path model = Qwen2_5_VLMoEForAction.from_pretrained(model_path) model.eval() # Configuration device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device).bfloat16() # Your inference code here... ``` ## ๐ŸŽฏ Supervised Fine-Tuning (SFT) For training Wall-X on your robotics datasets, please refer to our comprehensive training guide: **๐Ÿ“– [Training Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/workspace/README.md)** The training process includes: - **Dataset Preparation**: How to prepare your robotics datasets in LeRobot format - **Configuration Setup**: Detailed configuration for GPU setup, model paths, and robot DOF settings - **Training Scripts**: Ready-to-use training scripts with proper hyperparameters ### Quick Training Start ```bash # Run training (see workspace/README.md for detailed configuration) bash ./workspace/lerobot_example/run.sh ``` ## ๐Ÿ”ฎ Inference For detailed inference examples and model evaluation: **๐Ÿ“– [Inference Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/scripts/)** ### Basic Inference Example ```python import torch from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction # Load model model_path = "X-Square-Robot/wall-x" model = Qwen2_5_VLMoEForAction.from_pretrained(model_path) model.eval() # Setup batch_size = 1 seq_length = 50 device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device).bfloat16() # Prepare inputs (example with synthetic data) torch.manual_seed(0) input_ids = torch.randint(0, len(model.processor.tokenizer), (batch_size, seq_length), dtype=torch.long) attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long) moe_token_types = torch.zeros((batch_size, seq_length), dtype=torch.long) position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0).expand(batch_size, -1) # Robotics-specific inputs proprioception = torch.randn((batch_size, 1, 20), dtype=torch.float32) # Joint states agent_pos_mask = torch.ones((batch_size, 1, 20), dtype=torch.float32) dof_mask = torch.ones((batch_size, 32, 20), dtype=torch.float32) # DOF mask dataset_names = ["x2_normal"] # Move to device inputs = { "input_ids": input_ids.to(device), "attention_mask": attention_mask.to(device), "moe_token_types": moe_token_types.to(device), "position_ids": position_ids.to(device), "proprioception": proprioception.to(device).bfloat16(), "agent_pos_mask": agent_pos_mask.to(device).bfloat16(), "dof_mask": dof_mask.to(device).bfloat16(), "dataset_names": dataset_names, "mode": "validate" } # Run inference with torch.no_grad(): outputs = model(**inputs) print(f"Output logits shape: {outputs.logits.shape}") ``` ### Advanced Inference Scripts For production-ready inference and evaluation scripts: ```bash # Basic inference test python ./scripts/fake_inference.py # Generate open-loop comparison plots python ./scripts/draw_openloop_plot.py ``` **๐Ÿ“ [View all inference scripts](https://github.com/X-Square-Robot/wall-x/tree/main/scripts)** ## ๐Ÿ“š Complete Documentation For comprehensive setup, training, and inference instructions: ### ๐Ÿš€ **[Visit our GitHub Repository](https://github.com/X-Square-Robot/wall-x)** The repository contains: - **Detailed Installation Guide**: Complete environment setup with all dependencies - **Training Tutorials**: Step-by-step SFT process with LeRobot datasets - **Inference Examples**: Multiple inference scripts and evaluation tools - **Configuration Templates**: Ready-to-use configs for different robot setups - **Troubleshooting Guide**: Common issues and solutions ## ๐Ÿ“„ Cite Us If you find WALL-OSS models useful, please cite: ```bibtex @misc{walloss_paper_2025, title = {WALL-OSS: Igniting VLMs toward the Embodied Space}, author = {X Square Robot}, year = {2025}, howpublished = {\url{https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf}}, note = {White paper} } ```