# WALL-OSS

<div align="left">

<p align="center">
    <img src="assets/logo.png" width="600"/>
<p>

<div align="center">

[![Paper](https://img.shields.io/badge/📄%20Paper-PDF-EA1B22?style=for-the-badge&logo=adobeacrobatreader&logoColor=fff)](https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf)
&nbsp;&nbsp;
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-x--square--robot-FFB000?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/x-square-robot)
&nbsp;&nbsp;
[![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=fff)](https://github.com/X-Square-Robot/wall-x)
&nbsp;&nbsp;
[![Project Page](https://img.shields.io/badge/Project-1E90FF?style=for-the-badge&logo=google-chrome&logoColor=fff)](https://x2robot.com/en/research/68bc2cde8497d7f238dde690)

</div>

</div>

## <a href="https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf" target="_blank"><strong>WALL-OSS: Igniting VLMs toward the Embodied Space</strong></a>

We introduce **WALL-OSS**, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision--language understanding, (2) strong language--action association, and (3) robust manipulation capability.
Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT—seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework.
Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex   understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.

## 🎬 Video Demos

<div align="center">
    <video width="80%" controls>
        <source src="https://x2robot.com/api/videos/file/wall-oss_top_720p-1.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
    <p><strong>WALL-OSS in Action: Demonstrating advanced manipulation capabilities and embodied AI performance</strong></p>
</div>


## 🚀 Quick Start

### Installation

```bash
# Create conda environment
conda create --name wallx python=3.10
conda activate wallx

# Install base requirements
pip install torch torchvision transformers
pip install huggingface_hub

# Install Wall-X from GitHub
git clone https://github.com/X-Square-Robot/wall-x.git
cd wall-x
pip install -e .
```

### Basic Usage

```python
import torch
from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction

# Load the model
model_path = "X-Square-Robot/wall-oss-flow"  # or your local path
model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
model.eval()

# Configuration
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).bfloat16()

# Your inference code here...
```

## 🎯 Supervised Fine-Tuning (SFT)

For training Wall-X on your robotics datasets, please refer to our comprehensive training guide:

**📖 [Training Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/workspace/README.md)**

The training process includes:
- **Dataset Preparation**: How to prepare your robotics datasets in LeRobot format
- **Configuration Setup**: Detailed configuration for GPU setup, model paths, and robot DOF settings
- **Training Scripts**: Ready-to-use training scripts with proper hyperparameters

### Quick Training Start

```bash
# Run training (see workspace/README.md for detailed configuration)
bash ./workspace/lerobot_example/run.sh
```

## 🔮 Inference

For detailed inference examples and model evaluation:

**📖 [Inference Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/scripts/)**

### Basic Inference Example

```python
import torch
from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction

# Load model
model_path = "X-Square-Robot/wall-x"
model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
model.eval()

# Setup
batch_size = 1
seq_length = 50
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).bfloat16()

# Prepare inputs (example with synthetic data)
torch.manual_seed(0)
input_ids = torch.randint(0, len(model.processor.tokenizer), (batch_size, seq_length), dtype=torch.long)
attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
moe_token_types = torch.zeros((batch_size, seq_length), dtype=torch.long)
position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0).expand(batch_size, -1)

# Robotics-specific inputs
proprioception = torch.randn((batch_size, 1, 20), dtype=torch.float32)  # Joint states
agent_pos_mask = torch.ones((batch_size, 1, 20), dtype=torch.float32)
dof_mask = torch.ones((batch_size, 32, 20), dtype=torch.float32)  # DOF mask
dataset_names = ["x2_normal"]

# Move to device
inputs = {
    "input_ids": input_ids.to(device),
    "attention_mask": attention_mask.to(device),
    "moe_token_types": moe_token_types.to(device),
    "position_ids": position_ids.to(device),
    "proprioception": proprioception.to(device).bfloat16(),
    "agent_pos_mask": agent_pos_mask.to(device).bfloat16(),
    "dof_mask": dof_mask.to(device).bfloat16(),
    "dataset_names": dataset_names,
    "mode": "validate"
}

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    print(f"Output logits shape: {outputs.logits.shape}")
```

### Advanced Inference Scripts

For production-ready inference and evaluation scripts:

```bash
# Basic inference test
python ./scripts/fake_inference.py

# Generate open-loop comparison plots
python ./scripts/draw_openloop_plot.py
```

**📁 [View all inference scripts](https://github.com/X-Square-Robot/wall-x/tree/main/scripts)**

## 📚 Complete Documentation

For comprehensive setup, training, and inference instructions:

### 🚀 **[Visit our GitHub Repository](https://github.com/X-Square-Robot/wall-x)**

The repository contains:
- **Detailed Installation Guide**: Complete environment setup with all dependencies
- **Training Tutorials**: Step-by-step SFT process with LeRobot datasets
- **Inference Examples**: Multiple inference scripts and evaluation tools
- **Configuration Templates**: Ready-to-use configs for different robot setups
- **Troubleshooting Guide**: Common issues and solutions

## 📄 Cite Us

If you find WALL-OSS models useful, please cite:

```bibtex
@misc{walloss_paper_2025,
  title        = {WALL-OSS: Igniting VLMs toward the Embodied Space},
  author       = {X Square Robot},
  year         = {2025},
  howpublished = {\url{https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf}},
  note         = {White paper}
}
```