Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation
This repository contains trained Colonel Blotto agents developed for the NeurIPS 2025 MindGames Workshop.
The system integrates a compact graph-based reinforcement learning policy with LLM-guided preference learning and distillation, enabling improved strategic adaptation without increasing policy capacity.
Overview
The approach combines:
- Graph Attention Networks for structured game-state encoding
- Proximal Policy Optimization (PPO) as the core learning algorithm
- FiLM-based opponent adaptation for fast response to opponent behavior
- Rollout-grounded preference learning using two large language models
- Supervised fine tuning (SFT) and Direct Preference Optimization (DPO) for teacher alignment
- Knowledge distillation from the aligned teacher into an efficient policy
The goal is not to replace RL with language models, but to inject strategic priors learned by LLMs back into a lightweight, fast policy suitable for competitive play.
Game Configuration
- Game: Colonel Blotto
- Battlefields: 3
- Units per round: 20
- Rounds per game: 5
- Action space size: 231 valid allocations
- Evaluation protocol: Fixed scripted and adaptive opponent pool
Policy Architecture
Graph-Based State Encoder
- Heterogeneous graph with 25–40 nodes
- Node types include:
- Battlefield nodes
- Recent round summary nodes
- Global state node
- Node feature dimension: 32
- Encoder:
- 3 Graph Attention layers
- 6 attention heads
- Hidden size 192
Opponent Modeling and Adaptation
- Opponent history encoded via a lightweight MLP
- FiLM adaptation layers modulate policy activations based on opponent embedding
- Enables rapid adjustment to non-stationary strategies
Action Head
- Portfolio-based action head with 6 latent strategies
- Strategies mixed via learned attention
- Total policy parameters: ~6.8M
Training Pipeline
Training follows a multi-stage curriculum:
Graph PPO Pretraining
- PPO with clip ratio 0.2
- Discount factor γ = 0.99
- GAE λ = 0.95
- Trained against a diverse scripted opponent pool
Preference Generation via Rollouts
- ~800 intermediate states sampled
- Candidate actions proposed by:
- Llama 3.1 Instruct
- Qwen 2.5 Instruct
- Each proposal evaluated with 4 stochastic rollouts
- Higher-return actions labeled preferred
- ~2,300 preference pairs generated
Teacher Alignment
- Supervised Fine Tuning on chosen actions
- Direct Preference Optimization using frozen reference model
Policy Distillation
- Aligned teacher generates state-to-action labels
- Graph policy trained via cross-entropy imitation
Final PPO Refinement
- PPO resumes using environment rewards
- Stabilizes behavior after distillation
Evaluation Results
Evaluation uses 1,000 games against a mixture of scripted and adaptive opponents.
| Agent | Win Rate | Risk Metric |
|---|---|---|
| PPO only | 58.4% ± 2.1 | Allocation collapse 14.2% |
| PPO + Distillation | 67.9% ± 1.8 | Allocation collapse 8.8% |
| Full curriculum | 78.4% | Exploitability proxy 0.48 |
- Allocation collapse: fraction of rounds placing >60% units on one field
- Distillation yields a +9.5 point win-rate gain over PPO
- Full curriculum yields +20 point gain with reduced over-specialization
These improvements arise from risk calibration and opponent-aware adaptation, not brute-force exploitation.
Repository Contents
Policy Checkpoints
policy_models/policy_after_ppo.ptpolicy_models/policy_after_distill.ptpolicy_models/policy_final.pt
LLM Teacher Models
sft_model/– supervised fine-tuned modeldpo_model/– preference-aligned model
Configuration and Logs
master_config.json– training configurationbattleground_eval.json– evaluation summaries
Usage
Load Policy
import torch
from policy import GraphPolicy
policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()
### Loading Fine-tuned LLM
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")
# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
🎓 Research Context
This work targets the NeurIPS 2025 MindGames Workshop with a focus on:
- Language models function effectively as strategic prior generators when grounded by rollouts
- Graph-based representations enable cross-strategy generalization under compact policies
- Distillation transfers high-level reasoning into fast, deployable agents
Key Innovations
- Heterogeneous Graph Representation: Novel graph structure for Blotto game states
- Ground-truth Counterfactual Learning: Exploiting game determinism
- Multi-scale Representation: Field-level, round-level, and game-level embeddings
- LLM-to-RL Distillation: Transferring strategic reasoning to efficient policies
📄 License
MIT License - See LICENSE file for details
🙏 Acknowledgments
- Built for NeurIPS 2025 MindGames Workshop
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU
Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} Uploaded from: Notebook Environment