Git Diff -> Commit Message (Gemma 3 270M IT + LoRA)

A small, fast model specialized to turn a git diff into a concise, English commit message. Built on top of google/gemma-3-270m-it and fine-tuned with LoRA using MLX on macOS.

Requirements

  • macOS with Apple Silicon (for MLX)
  • Python 3.8+
  • Required packages:
    pip install mlx-lm transformers
    

What this model expects (most important)

  • Input type: a unified git diff as plain text.
  • Wrap the diff in a Markdown code fence labeled diff for best results.
  • The diff should look like the output of git diff --no-color (hunk headers like @@, +/- line prefixes, file headers, etc.).
  • Keep diffs reasonably sized. The training/CLI path truncates diffs to ~3,000 characters and trains/infers with a context window of ~2,048 tokens. Extremely large diffs should be summarized or sampled.
  • Language of response: English only. The system prompt enforces English output.

Training Data Format

This model was trained on the data/train_gpt-oss-20b.jsonl dataset in this repository. The training data uses Gemma's chat template format with the following exact structure:

User prompt format (as seen in training data):

Generate a concise and descriptive commit message for this git diff:

```diff
diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
index <HASH>..<HASH> 100644
--- a/src/ossos-pipeline/scripts/update_astrometry.py
+++ b/src/ossos-pipeline/scripts/update_astrometry.py
@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
     cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
     cutout.zmag = new_zp
 
+    if math.fabs(new_zp - old_zp) > 0.3:
+        logging.warning("Large change in zeropoint detected: {}  -> {}".format(old_zp, new_zp))
+
     try:
-        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
+        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
         (x, y) = cutout.get_observed_coordinates((x, y))
     except:
         logging.warn("Failed to do photometry.")

**Important:** To get the best results, match this exact format including:
- The instruction text: "Generate a concise and descriptive commit message for this git diff:"
- The double newline after the instruction
- The diff wrapped in triple backticks with `diff` language tag
- Hash placeholders shown as `<HASH>..<HASH>` in the diff headers

### Chat template (Gemma 3)
The model was trained using Gemma's chat template with the system prompt enforcing English-only responses. The conceptual structure is:

- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
- user: The exact format shown above
- assistant: single-line commit message (target)

### Chat template (Gemma 3)
The model was trained and inferred using Gemma’s chat template. Conceptually:

- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
- user: "Generate a concise and descriptive commit message for this git diff:" + the diff wrapped in ```diff fences
- assistant: single-line commit message (target)

Training data (chat format) examples were stored like:

```json
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."},
    {"role": "user", "content": "Generate a concise and descriptive commit message for this git diff:\n\n```diff\n<diff text>\n```"},
    {"role": "assistant", "content": "<single-line commit message>"}
  ]
}

Output

  • A single-line commit subject, in English.
  • The CLI post-processes the generation and returns the first non-empty line.
  • Keep it concise and descriptive; optionally target ~50–72 characters where possible.

Quick usage

Python Script (MLX)

Here's a complete standalone script to generate commit messages using this model:

#!/usr/bin/env python3
"""
Standalone script to generate git commit messages using the fine-tuned Gemma model.
Requires: mlx-lm, transformers
Install with: pip install mlx-lm transformers
"""

import subprocess
import sys
from mlx_lm import load, generate
from transformers import AutoTokenizer

def get_staged_diff():
    """Get the staged git diff from the current repository."""
    try:
        result = subprocess.run(
            ['git', 'diff', '--staged', '--no-color'], 
            capture_output=True, text=True, check=True
        )
        return result.stdout.strip()
    except subprocess.CalledProcessError:
        print("Error: Could not get git diff. Make sure you're in a git repository with staged changes.")
        return None

def format_prompt(diff_text, tokenizer):
    """Format the diff into the exact training data format."""
    system_prompt = "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
    user_message = f"Generate a concise and descriptive commit message for this git diff:\n\n```diff\n{diff_text}\n```"
    
    # Format using Gemma chat template
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ]
    
    prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    return prompt

def generate_commit_message(diff_text, model_path="your-username/git-diff-to-commit-gemma-3-270m"):
    """Generate a commit message from a git diff."""
    
    # Load model and tokenizer
    print("Loading model...")
    model, mlx_tokenizer = load(model_path)
    hf_tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")
    
    # Format the prompt
    prompt = format_prompt(diff_text, hf_tokenizer)
    
    # Generate response
    print("Generating commit message...")
    response = generate(
        model, 
        mlx_tokenizer, 
        prompt=prompt, 
        max_tokens=100,
        temp=0.7,
        top_p=0.9,
        verbose=False
    )
    
    # Extract just the generated part (after the prompt)
    generated_text = response[len(prompt):].strip()
    
    # Return the first non-empty line
    lines = [line.strip() for line in generated_text.split('\n') if line.strip()]
    return lines[0] if lines else "Unable to generate commit message"

def main():
    """Main function - can be used with staged diff or provided diff text."""
    
    if len(sys.argv) > 1:
        # Use provided diff file
        diff_file = sys.argv[1]
        try:
            with open(diff_file, 'r') as f:
                diff_text = f.read().strip()
        except FileNotFoundError:
            print(f"Error: File {diff_file} not found.")
            return
    else:
        # Get staged diff from git
        diff_text = get_staged_diff()
        if not diff_text:
            print("No staged changes found. Stage some changes with 'git add' first.")
            return
    
    if not diff_text:
        print("No diff content to process.")
        return
    
    # Generate and print commit message
    commit_message = generate_commit_message(diff_text)
    print(f"\nSuggested commit message:")
    print(f"  {commit_message}")

if __name__ == "__main__":
    main()

Usage Examples

  1. Generate from staged git changes:

    python generate_commit.py
    
  2. Generate from a diff file:

    python generate_commit.py my_changes.diff
    
  3. Use in your own code:

    from generate_commit import generate_commit_message
    
    diff = """diff --git a/app.py b/app.py
    index e69de29..f4c3b4a 100644
    --- a/app.py
    +++ b/app.py
    @@ -0,0 +1,3 @@
    +def add(a, b):
    +    return a + b
    """
    
    message = generate_commit_message(diff)
    print(message)
    

Examples

Input (user message content as formatted in training data):

Generate a concise and descriptive commit message for this git diff:

```diff
diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
index <HASH>..<HASH> 100644
--- a/src/ossos-pipeline/scripts/update_astrometry.py
+++ b/src/ossos-pipeline/scripts/update_astrometry.py
@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
     cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
     cutout.zmag = new_zp
 
+    if math.fabs(new_zp - old_zp) > 0.3:
+        logging.warning("Large change in zeropoint detected: {}  -> {}".format(old_zp, new_zp))
+
     try:
-        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
+        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
         (x, y) = cutout.get_observed_coordinates((x, y))
     except:
         logging.warn("Failed to do photometry.")

Possible outputs:

  • fix: use new_zp instead of old_zp for magnitude calculation and add zeropoint change warning
  • fix: correct zeropoint usage in photometry and add warning for large zeropoint changes
  • refactor: update magnitude calculation to use new zeropoint and add change detection

Training summary

  • Base model: google/gemma-3-270m-it (Gemma 3, 270M, instruction-tuned).
  • Method: LoRA fine-tuning with MLX (mlx_lm lora). Prompt masking was enabled so the model learns from the assistant response.
  • Training data: data/train_gpt-oss-20b.jsonl in this repository - a dataset converted to chat format with diffs fenced as ```diff and English, single-line commit messages as targets. This dataset is Python-focused.
  • Data format: Each training example uses the exact user prompt format shown above in the chat template structure.
  • Context/config highlights: max sequence length ~2048 tokens; diffs truncated to ~3,000 characters during preprocessing/inference to be model-friendly.
  • Important: To achieve best results, match the exact input format used in the training data.

Evaluation

  • The repo includes a lightweight evaluation that compares generated messages to a reference using a simple string similarity (SequenceMatcher) across multiple runs (varying the RNG seed). Results and artifacts are saved under evaluation_results/.

Limitations and risks

  • Diff size sensitivity: Very large diffs may be truncated; consider summarizing large changes.
  • Domain bias: Training set emphasized Python diffs; behavior may be better for Python-heavy repos.
  • Hallucinations: As with any LLM, may produce generic or mismatched messages if the diff is ambiguous.
  • Security: Do not feed secrets; generated text may inadvertently paraphrase sensitive context.
  • Language: System prompt enforces English responses.

Intended use

  • Assist developers by proposing a concise commit subject from a given git diff.
  • Not a replacement for human judgment; review messages before committing.

How to format inputs yourself

If you’re not using the CLI helpers, follow this structure with the Gemma chat template:

  • system: English-only instruction for commit message generation (see above)
  • user: instruction + the diff in ```diff code fences
  • assistant: the target single-line subject (for training) or left empty (for inference)

The repository’s format_commit_message_prompt builds the correct prompt for Gemma 3.

License and credits

  • Base model: Google Gemma 3 (google/gemma-3-270m-it). Use subject to the Gemma license terms.
  • Fine-tuning code: MLX and utilities in this repository. See repository license for details.
Downloads last month
129
Safetensors
Model size
436M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hks350d/git-diff-to-commit-gemma-3-270m

Quantized
(33)
this model

Datasets used to train hks350d/git-diff-to-commit-gemma-3-270m