🚀 MiniCoderX: A Lightweight Transformer for Code Generation

MiniCoderX is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like LangChain and Ollama, making it ideal for rapid local experimentation.

Link -> https://v0-mini-coder-x.vercel.app/


✨ Features

  • 🧠 Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2)
  • 🌲 AST/CFG-aware encoding for code structure understanding
  • 💾 Syntax-constrained decoding using grammar rules and trees
  • 🔁 Multi-task heads: generation, summarization, translation, bug fixing
  • ⚙️ LangChain + Ollama integration for fast local deployment
  • 🧪 Evaluated on HumanEval, CodeXGLUE, MBPP

🏗️ Model Architecture

Component Description
Base Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5)
Structure-aware AST and Control Flow Graph embeddings + positional masks
Heads Multi-task heads for flexible downstream use
Decoder Syntax-aware beam search (grammar constraints)
Tokenizer BPE or SentencePiece trained on code + comments

🔧 Architectural Additions (SOTA Techniques)

🌲 AST/CFG Embeddings

Enhances understanding of code structure by:

  • Adding AST node/edge embeddings to token inputs
  • Including path embeddings between syntactic elements
  • Graph-aware position encoding

Inspired by: StructCoder, AST-T5, Code4Struct

💾 Syntax-Constrained Decoding

Improves generation accuracy and reduces invalid code by:

  • Restricting token outputs using grammar constraints (BNF/PEG)
  • Custom decoding logic (e.g., Tree traversal)
  • Dynamic decoding masks based on token state

Inspired by: TreeGen, Code4Struct

🔁 Multi-Task Learning Heads

Supports multiple tasks:

  • Code generation (NL → Code)
  • Summarization (Code → NL)
  • Translation (Java ⇄ Python)
  • Code repair and completion

Inspired by: CodeT5+, CoTexT


⚡ LangChain + Ollama Integration

💡 Why?

To enable:

  • 🧪 Local testing and chaining of models via LangChain
  • 🦮 Fast prototyping with Ollama for custom transformer backends
  • 🔄 Easy switch between small local models and larger remote APIs

🔌 Integration Plan

from langchain.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# Load MiniCoderX with Ollama
llm = Ollama(model="minicoderx")  # Local model via Ollama

# Define code generation prompt
prompt = PromptTemplate(
    input_variables=["instruction"],
    template="Generate Python code for the task: {instruction}",
)

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run("Sort a list of integers using quicksort")

print(result)

✅ Ollama will be used to serve your fine-tuned SLM locally
✅ LangChain will wrap it with prompts, chains, and memory features for interactivity


📦 Datasets

Dataset Use
The Stack (subset) Pretraining corpus
CodeSearchNet Summarization, Search
HumanEval Code generation benchmark
MBPP Python programming prompts
Bugs2Fix Code repair
Java-Python Cross-language translation

🔬 Training Objectives

  • ✅ Span Masking (CodeT5-style)
  • ✅ Contrastive pretraining
  • ✅ Instruction tuning (natural prompt formatting)
  • ✅ Auto-regressive generation

📊 Evaluation Benchmarks

Benchmark Metric
HumanEval Pass@1, BLEU
MBPP Accuracy
CodeXGLUE CodeBLEU, EM
Unit Tests Pass Rate

🧪 Project Roadmap

✅ Phase 1: MVP Model

  • Train TinyCodeT5 model with span masking
  • Evaluate on MBPP and HumanEval-lite
  • Serve via Ollama + LangChain prompt chain

🔁 Phase 2: Structural Learning

  • Add AST/CFG encodings
  • Introduce grammar-constrained decoding
  • Multi-task training (gen, sum, repair)

📦 Phase 3: Optimization & Packaging

  • Distill from larger model (e.g., StarCoder)
  • Add reinforcement fine-tuning via test cases
  • Export to Hugging Face + Ollama integration

🛠️ Tools & Frameworks


🤝 Contributing

Want to help with grammar decoders, AST integration, or evaluation? PRs welcome!


📜 License

MIT License. Built for research and open experimentation.


📧 Contact

Drop an issue or discussion on GitHub!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train sanjudebnath/MiniCoderX