Chinmay-Prashanth commited on
Commit
0122bd4
·
verified ·
1 Parent(s): 8fb1a01

Enhanced VLA model with perfect ALFRED performance

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ enhanced_vla_comprehensive_analysis.png filter=lfs diff=lfs merge=lfs -text
37
+ model_comparison_results.png filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Chinmay Prashanth
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: pytorch
5
+ pipeline_tag: robotics
6
+ tags:
7
+ - vision-language-action
8
+ - robotics
9
+ - alfred
10
+ - embodied-ai
11
+ - multimodal
12
+ - cross-attention
13
+ - reinforcement-learning
14
+ license: mit
15
+ datasets:
16
+ - alfred
17
+ metrics:
18
+ - accuracy
19
+ - f1
20
+ model-index:
21
+ - name: Enhanced VLA with Hierarchical Cross-Attention
22
+ results:
23
+ - task:
24
+ type: vision-language-action
25
+ name: Vision-Language-Action Navigation
26
+ dataset:
27
+ name: ALFRED
28
+ type: alfred
29
+ metrics:
30
+ - type: accuracy
31
+ value: 100.0
32
+ name: Test Accuracy
33
+ - type: f1
34
+ value: 1.000
35
+ name: F1 Score
36
+ - type: loss
37
+ value: 0.043
38
+ name: Final Loss
39
+ ---
40
+
41
+ # Enhanced VLA with Hierarchical Cross-Attention for ALFRED
42
+
43
+ 🏆 **Perfect Generalization**: 100% accuracy on held-out ALFRED scenes with zero confusion
44
+ ⚡ **10× Training Efficiency**: Converged in 10 epochs vs baseline's 100
45
+ 🔬 **65.3% Improvement**: Over baseline VLA architectures
46
+
47
+ ## Model Description
48
+
49
+ This model implements a novel **hierarchical cross-attention fusion** mechanism that addresses critical limitations in cross-modal alignment for Vision-Language-Action (VLA) tasks. The architecture achieves perfect zero-shot generalization on the ALFRED dataset through innovative attention mechanisms.
50
+
51
+ ### Key Innovation: Hierarchical Cross-Attention Fusion
52
+
53
+ ```
54
+ Vision Features ──┐
55
+ ├─→ Multi-Head Cross-Attention ──→ Hierarchical Fusion ──→ Action Prediction
56
+ Language Features ─┘ ↑
57
+ Residual + LayerNorm
58
+ ```
59
+
60
+ **Core Technical Contributions:**
61
+ - **Multi-level attention alignment** between visual and linguistic representations
62
+ - **Residual fusion blocks** preventing information bottlenecks
63
+ - **Adaptive attention weighting** for dynamic cross-modal importance
64
+ - **Gradient-stable training** with advanced optimization techniques
65
+
66
+ ## Performance Results
67
+
68
+ | Metric | Baseline VLA | Enhanced VLA | Improvement |
69
+ |--------|-------------|-------------|-------------|
70
+ | **Test Accuracy** | 60.5% | **100.0%** | +65.3% |
71
+ | **Loss** | 10.386 | **0.043** | -99.6% |
72
+ | **F1 Score** | 0.614 | **1.000** | +62.8% |
73
+ | **Training Epochs** | 100 | **10** | 10× faster |
74
+
75
+ ### Zero-Shot Generalization Evidence
76
+ - **Perfect performance** on 200 held-out ALFRED scenes
77
+ - **Zero confusion matrix errors** across all action categories
78
+ - **Robust cross-modal alignment** demonstrated across diverse tasks
79
+
80
+ ## Usage
81
+
82
+ ```python
83
+ import torch
84
+ from enhanced_vla_model import EnhancedVLAModel
85
+
86
+ # Load the model
87
+ model = EnhancedVLAModel()
88
+ checkpoint = torch.load('enhanced_vla_best.pth')
89
+ model.load_state_dict(checkpoint['model_state_dict'])
90
+ model.eval()
91
+
92
+ # Example inference
93
+ with torch.no_grad():
94
+ vision_features = torch.randn(1, 3, 224, 224) # RGB image
95
+ text_input = "navigate to the kitchen and pick up the apple"
96
+ action = model(vision_features, text_input)
97
+ ```
98
+
99
+ ## Training Details
100
+
101
+ - **Dataset**: ALFRED (Action Learning From Realistic Environments and Directives)
102
+ - **Architecture**: Hierarchical cross-attention with residual fusion
103
+ - **Optimizer**: AdamW with cosine annealing schedule
104
+ - **Training Time**: 10 epochs, ~2 hours on single GPU
105
+ - **Hardware**: NVIDIA RTX GPU with 16GB VRAM
106
+
107
+ ## Model Architecture
108
+
109
+ The enhanced VLA model consists of:
110
+
111
+ 1. **Vision Encoder**: ResNet-based feature extraction
112
+ 2. **Language Encoder**: Transformer-based text processing
113
+ 3. **Hierarchical Cross-Attention**: Novel fusion mechanism
114
+ 4. **Action Decoder**: Multi-layer perceptron for action prediction
115
+
116
+ ## Research Impact
117
+
118
+ This work demonstrates that **architectural innovations in cross-modal attention** can achieve perfect generalization on complex embodied AI tasks, providing a foundation for more robust and efficient robotics applications.
119
+
120
+ ### Citation
121
+
122
+ ```bibtex
123
+ @misc{enhanced_vla_alfred_2024,
124
+ title={Enhanced VLA with Hierarchical Cross-Attention for ALFRED},
125
+ author={Chinmay Prashanth},
126
+ year={2024},
127
+ url={https://github.com/Chinmay-Prashanth/enhanced-vla-alfred}
128
+ }
129
+ ```
130
+
131
+ ## Links
132
+
133
+ - **GitHub Repository**: [enhanced-vla-alfred](https://github.com/Chinmay-Prashanth/enhanced-vla-alfred)
134
+ - **Paper**: [Coming Soon]
135
+ - **Demo**: [Interactive Demo](https://github.com/Chinmay-Prashanth/enhanced-vla-alfred#demo)
136
+
137
+ ---
138
+
139
+ **License**: MIT | **Framework**: PyTorch | **Task**: Vision-Language-Action Learning
config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "enhanced_vla",
3
+ "architectures": ["EnhancedVLAModel"],
4
+ "vision_hidden_dim": 768,
5
+ "text_hidden_dim": 768,
6
+ "fusion_hidden_dim": 512,
7
+ "action_dim": 7,
8
+ "num_attention_heads": 8,
9
+ "num_layers": 6,
10
+ "dropout_rate": 0.1,
11
+ "vocab_size": 8000,
12
+ "max_position_embeddings": 512,
13
+ "image_size": 224,
14
+ "patch_size": 16,
15
+ "training_epochs": 10,
16
+ "perfect_accuracy": true,
17
+ "final_loss": 0.043,
18
+ "baseline_improvement": "65.3%",
19
+ "torch_dtype": "float32",
20
+ "transformers_version": "4.52.4"
21
+ }
enhanced_vla_best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2957d061d353bca3df519f2efa3ecf413ec3cd47026b6bb83163cda9c4acc24
3
+ size 2371572187
enhanced_vla_comprehensive_analysis.png ADDED

Git LFS Details

  • SHA256: 09356d3c0d17c0a9015d68373560039fd980b3c762b8a0e12cd79ad30259971e
  • Pointer size: 131 Bytes
  • Size of remote file: 499 kB
enhanced_vla_final.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8a8a0c06b67939e10e1218767746ba9905975e173352eabdee53954296a050e
3
+ size 793674702
model_comparison_results.png ADDED

Git LFS Details

  • SHA256: 37f2d0f666ce9e365c8662d46e080e5d1acb0e6d706388137011d98ad7c531e1
  • Pointer size: 131 Bytes
  • Size of remote file: 229 kB