🚀 MiniCoderX: A Lightweight Transformer for Code Generation
MiniCoderX is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like LangChain and Ollama, making it ideal for rapid local experimentation.
Link -> https://v0-mini-coder-x.vercel.app/
✨ Features
- 🧠 Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2)
- 🌲 AST/CFG-aware encoding for code structure understanding
- 💾 Syntax-constrained decoding using grammar rules and trees
- 🔁 Multi-task heads: generation, summarization, translation, bug fixing
- ⚙️ LangChain + Ollama integration for fast local deployment
- 🧪 Evaluated on HumanEval, CodeXGLUE, MBPP
🏗️ Model Architecture
Component | Description |
---|---|
Base | Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5) |
Structure-aware | AST and Control Flow Graph embeddings + positional masks |
Heads | Multi-task heads for flexible downstream use |
Decoder | Syntax-aware beam search (grammar constraints) |
Tokenizer | BPE or SentencePiece trained on code + comments |
🔧 Architectural Additions (SOTA Techniques)
🌲 AST/CFG Embeddings
Enhances understanding of code structure by:
- Adding AST node/edge embeddings to token inputs
- Including path embeddings between syntactic elements
- Graph-aware position encoding
Inspired by: StructCoder, AST-T5, Code4Struct
💾 Syntax-Constrained Decoding
Improves generation accuracy and reduces invalid code by:
- Restricting token outputs using grammar constraints (BNF/PEG)
- Custom decoding logic (e.g., Tree traversal)
- Dynamic decoding masks based on token state
Inspired by: TreeGen, Code4Struct
🔁 Multi-Task Learning Heads
Supports multiple tasks:
- Code generation (NL → Code)
- Summarization (Code → NL)
- Translation (Java ⇄ Python)
- Code repair and completion
Inspired by: CodeT5+, CoTexT
⚡ LangChain + Ollama Integration
💡 Why?
To enable:
- 🧪 Local testing and chaining of models via LangChain
- 🦮 Fast prototyping with Ollama for custom transformer backends
- 🔄 Easy switch between small local models and larger remote APIs
🔌 Integration Plan
from langchain.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
# Load MiniCoderX with Ollama
llm = Ollama(model="minicoderx") # Local model via Ollama
# Define code generation prompt
prompt = PromptTemplate(
input_variables=["instruction"],
template="Generate Python code for the task: {instruction}",
)
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run("Sort a list of integers using quicksort")
print(result)
✅ Ollama will be used to serve your fine-tuned SLM locally
✅ LangChain will wrap it with prompts, chains, and memory features for interactivity
📦 Datasets
Dataset | Use |
---|---|
The Stack (subset) | Pretraining corpus |
CodeSearchNet | Summarization, Search |
HumanEval | Code generation benchmark |
MBPP | Python programming prompts |
Bugs2Fix | Code repair |
Java-Python | Cross-language translation |
🔬 Training Objectives
- ✅ Span Masking (CodeT5-style)
- ✅ Contrastive pretraining
- ✅ Instruction tuning (natural prompt formatting)
- ✅ Auto-regressive generation
📊 Evaluation Benchmarks
Benchmark | Metric |
---|---|
HumanEval | Pass@1, BLEU |
MBPP | Accuracy |
CodeXGLUE | CodeBLEU, EM |
Unit Tests | Pass Rate |
🧪 Project Roadmap
✅ Phase 1: MVP Model
- Train TinyCodeT5 model with span masking
- Evaluate on MBPP and HumanEval-lite
- Serve via Ollama + LangChain prompt chain
🔁 Phase 2: Structural Learning
- Add AST/CFG encodings
- Introduce grammar-constrained decoding
- Multi-task training (gen, sum, repair)
📦 Phase 3: Optimization & Packaging
- Distill from larger model (e.g., StarCoder)
- Add reinforcement fine-tuning via test cases
- Export to Hugging Face + Ollama integration
🛠️ Tools & Frameworks
- Hugging Face Transformers
- LangChain
- Ollama
- SentencePiece / BPE
- NetworkX for AST/CFG parsing
🤝 Contributing
Want to help with grammar decoders, AST integration, or evaluation? PRs welcome!
📜 License
MIT License. Built for research and open experimentation.
📧 Contact
Drop an issue or discussion on GitHub!