sanjudebnath commited on
Commit
e2d1124
·
verified ·
1 Parent(s): 9079429

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -3
README.md CHANGED
@@ -1,3 +1,169 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 MiniCoderX: A Lightweight Transformer for Code Generation
2
+
3
+ **MiniCoderX** is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like **LangChain** and **Ollama**, making it ideal for rapid local experimentation.
4
+
5
+ ---
6
+
7
+ ## ✨ Features
8
+
9
+ - 🧠 Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2)
10
+ - 🌲 AST/CFG-aware encoding for code structure understanding
11
+ - 💾 Syntax-constrained decoding using grammar rules and trees
12
+ - 🔁 Multi-task heads: generation, summarization, translation, bug fixing
13
+ - ⚙️ LangChain + Ollama integration for fast local deployment
14
+ - 🧪 Evaluated on HumanEval, CodeXGLUE, MBPP
15
+
16
+ ---
17
+
18
+ ## 🏗️ Model Architecture
19
+
20
+ | Component | Description |
21
+ |----------------|-----------------------------------------------------------|
22
+ | Base | Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5) |
23
+ | Structure-aware | AST and Control Flow Graph embeddings + positional masks |
24
+ | Heads | Multi-task heads for flexible downstream use |
25
+ | Decoder | Syntax-aware beam search (grammar constraints) |
26
+ | Tokenizer | BPE or SentencePiece trained on code + comments |
27
+
28
+ ---
29
+
30
+ ## 🔧 Architectural Additions (SOTA Techniques)
31
+
32
+ ### 🌲 AST/CFG Embeddings
33
+ Enhances understanding of code structure by:
34
+ - Adding AST node/edge embeddings to token inputs
35
+ - Including path embeddings between syntactic elements
36
+ - Graph-aware position encoding
37
+
38
+ Inspired by: **StructCoder**, **AST-T5**, **Code4Struct**
39
+
40
+ ### 💾 Syntax-Constrained Decoding
41
+ Improves generation accuracy and reduces invalid code by:
42
+ - Restricting token outputs using grammar constraints (BNF/PEG)
43
+ - Custom decoding logic (e.g., Tree traversal)
44
+ - Dynamic decoding masks based on token state
45
+
46
+ Inspired by: **TreeGen**, **Code4Struct**
47
+
48
+ ### 🔁 Multi-Task Learning Heads
49
+ Supports multiple tasks:
50
+ - Code generation (NL → Code)
51
+ - Summarization (Code → NL)
52
+ - Translation (Java ⇄ Python)
53
+ - Code repair and completion
54
+
55
+ Inspired by: **CodeT5+**, **CoTexT**
56
+
57
+ ---
58
+
59
+ ## ⚡ LangChain + Ollama Integration
60
+
61
+ ### 💡 Why?
62
+ To enable:
63
+ - 🧪 Local testing and chaining of models via **LangChain**
64
+ - 🦮 Fast prototyping with **Ollama** for custom transformer backends
65
+ - 🔄 Easy switch between small local models and larger remote APIs
66
+
67
+ ### 🔌 Integration Plan
68
+ ```python
69
+ from langchain.llms import Ollama
70
+ from langchain.chains import LLMChain
71
+ from langchain.prompts import PromptTemplate
72
+
73
+ # Load MiniCoderX with Ollama
74
+ llm = Ollama(model="minicoderx") # Local model via Ollama
75
+
76
+ # Define code generation prompt
77
+ prompt = PromptTemplate(
78
+ input_variables=["instruction"],
79
+ template="Generate Python code for the task: {instruction}",
80
+ )
81
+
82
+ chain = LLMChain(llm=llm, prompt=prompt)
83
+ result = chain.run("Sort a list of integers using quicksort")
84
+
85
+ print(result)
86
+ ```
87
+
88
+ > ✅ Ollama will be used to serve your fine-tuned SLM locally
89
+ > ✅ LangChain will wrap it with prompts, chains, and memory features for interactivity
90
+
91
+ ---
92
+
93
+ ## 📦 Datasets
94
+
95
+ | Dataset | Use |
96
+ |----------------|----------------------------|
97
+ | The Stack (subset) | Pretraining corpus |
98
+ | CodeSearchNet | Summarization, Search |
99
+ | HumanEval | Code generation benchmark |
100
+ | MBPP | Python programming prompts |
101
+ | Bugs2Fix | Code repair |
102
+ | Java-Python | Cross-language translation |
103
+
104
+ ---
105
+
106
+ ## 🔬 Training Objectives
107
+
108
+ - ✅ Span Masking (CodeT5-style)
109
+ - ✅ Contrastive pretraining
110
+ - ✅ Instruction tuning (natural prompt formatting)
111
+ - ✅ Auto-regressive generation
112
+
113
+ ---
114
+
115
+ ## 📊 Evaluation Benchmarks
116
+
117
+ | Benchmark | Metric |
118
+ |------------|-------------------|
119
+ | HumanEval | Pass@1, BLEU |
120
+ | MBPP | Accuracy |
121
+ | CodeXGLUE | CodeBLEU, EM |
122
+ | Unit Tests | Pass Rate |
123
+
124
+ ---
125
+
126
+ ## 🧪 Project Roadmap
127
+
128
+ ### ✅ Phase 1: MVP Model
129
+ - Train TinyCodeT5 model with span masking
130
+ - Evaluate on MBPP and HumanEval-lite
131
+ - Serve via Ollama + LangChain prompt chain
132
+
133
+ ### 🔁 Phase 2: Structural Learning
134
+ - Add AST/CFG encodings
135
+ - Introduce grammar-constrained decoding
136
+ - Multi-task training (gen, sum, repair)
137
+
138
+ ### 📦 Phase 3: Optimization & Packaging
139
+ - Distill from larger model (e.g., StarCoder)
140
+ - Add reinforcement fine-tuning via test cases
141
+ - Export to Hugging Face + Ollama integration
142
+
143
+ ---
144
+
145
+ ## 🛠️ Tools & Frameworks
146
+
147
+ - [Hugging Face Transformers](https://github.com/huggingface/transformers)
148
+ - [LangChain](https://github.com/langchain-ai/langchain)
149
+ - [Ollama](https://ollama.com/)
150
+ - SentencePiece / BPE
151
+ - NetworkX for AST/CFG parsing
152
+
153
+ ---
154
+
155
+ ## 🤝 Contributing
156
+
157
+ Want to help with grammar decoders, AST integration, or evaluation? PRs welcome!
158
+
159
+ ---
160
+
161
+ ## 📜 License
162
+
163
+ MIT License. Built for research and open experimentation.
164
+
165
+ ---
166
+
167
+ ## 📧 Contact
168
+
169
+ Drop an issue or discussion on GitHub!