Git Diff -> Commit Message (Gemma 3 270M IT + LoRA)

A small, fast model specialized to turn a git diff into a concise, English commit message. Built on top of google/gemma-3-270m-it and fine-tuned with LoRA using MLX on macOS.

Requirements

macOS with Apple Silicon (for MLX)
Python 3.8+
Required packages:
```
pip install mlx-lm transformers
```

What this model expects (most important)

Input type: a unified git diff as plain text.
Wrap the diff in a Markdown code fence labeled diff for best results.
The diff should look like the output of git diff --no-color (hunk headers like @@, +/- line prefixes, file headers, etc.).
Keep diffs reasonably sized. The training/CLI path truncates diffs to ~3,000 characters and trains/infers with a context window of ~2,048 tokens. Extremely large diffs should be summarized or sampled.
Language of response: English only. The system prompt enforces English output.

Training Data Format

This model was trained on the data/train_gpt-oss-20b.jsonl dataset in this repository. The training data uses Gemma's chat template format with the following exact structure:

User prompt format (as seen in training data):

Generate a concise and descriptive commit message for this git diff:

```diff
diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
index <HASH>..<HASH> 100644
--- a/src/ossos-pipeline/scripts/update_astrometry.py
+++ b/src/ossos-pipeline/scripts/update_astrometry.py
@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
     cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
     cutout.zmag = new_zp
 
+    if math.fabs(new_zp - old_zp) > 0.3:
+        logging.warning("Large change in zeropoint detected: {}  -> {}".format(old_zp, new_zp))
+
     try:
-        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
+        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
         (x, y) = cutout.get_observed_coordinates((x, y))
     except:
         logging.warn("Failed to do photometry.")


**Important:** To get the best results, match this exact format including:
- The instruction text: "Generate a concise and descriptive commit message for this git diff:"
- The double newline after the instruction
- The diff wrapped in triple backticks with `diff` language tag
- Hash placeholders shown as `<HASH>..<HASH>` in the diff headers

### Chat template (Gemma 3)
The model was trained using Gemma's chat template with the system prompt enforcing English-only responses. The conceptual structure is:

- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
- user: The exact format shown above
- assistant: single-line commit message (target)

### Chat template (Gemma 3)
The model was trained and inferred using Gemma’s chat template. Conceptually:

- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
- user: "Generate a concise and descriptive commit message for this git diff:" + the diff wrapped in ```diff fences
- assistant: single-line commit message (target)

Training data (chat format) examples were stored like:

```json
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."},
    {"role": "user", "content": "Generate a concise and descriptive commit message for this git diff:\n\n```diff\n<diff text>\n```"},
    {"role": "assistant", "content": "<single-line commit message>"}
  ]
}

Output

A single-line commit subject, in English.
The CLI post-processes the generation and returns the first non-empty line.
Keep it concise and descriptive; optionally target ~50–72 characters where possible.

Quick usage

Python Script (MLX)

Here's a complete standalone script to generate commit messages using this model:

#!/usr/bin/env python3
"""
Standalone script to generate git commit messages using the fine-tuned Gemma model.
Requires: mlx-lm, transformers
Install with: pip install mlx-lm transformers
"""

import subprocess
import sys
from mlx_lm import load, generate
from transformers import AutoTokenizer

def get_staged_diff():
    """Get the staged git diff from the current repository."""
    try:
        result = subprocess.run(
            ['git', 'diff', '--staged', '--no-color'], 
            capture_output=True, text=True, check=True
        )
        return result.stdout.strip()
    except subprocess.CalledProcessError:
        print("Error: Could not get git diff. Make sure you're in a git repository with staged changes.")
        return None

def format_prompt(diff_text, tokenizer):
    """Format the diff into the exact training data format."""
    system_prompt = "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
    user_message = f"Generate a concise and descriptive commit message for this git diff:\n\n```diff\n{diff_text}\n```"
    
    # Format using Gemma chat template
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ]
    
    prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    return prompt

def generate_commit_message(diff_text, model_path="your-username/git-diff-to-commit-gemma-3-270m"):
    """Generate a commit message from a git diff."""
    
    # Load model and tokenizer
    print("Loading model...")
    model, mlx_tokenizer = load(model_path)
    hf_tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")
    
    # Format the prompt
    prompt = format_prompt(diff_text, hf_tokenizer)
    
    # Generate response
    print("Generating commit message...")
    response = generate(
        model, 
        mlx_tokenizer, 
        prompt=prompt, 
        max_tokens=100,
        temp=0.7,
        top_p=0.9,
        verbose=False
    )
    
    # Extract just the generated part (after the prompt)
    generated_text = response[len(prompt):].strip()
    
    # Return the first non-empty line
    lines = [line.strip() for line in generated_text.split('\n') if line.strip()]
    return lines[0] if lines else "Unable to generate commit message"

def main():
    """Main function - can be used with staged diff or provided diff text."""
    
    if len(sys.argv) > 1:
        # Use provided diff file
        diff_file = sys.argv[1]
        try:
            with open(diff_file, 'r') as f:
                diff_text = f.read().strip()
        except FileNotFoundError:
            print(f"Error: File {diff_file} not found.")
            return
    else:
        # Get staged diff from git
        diff_text = get_staged_diff()
        if not diff_text:
            print("No staged changes found. Stage some changes with 'git add' first.")
            return
    
    if not diff_text:
        print("No diff content to process.")
        return
    
    # Generate and print commit message
    commit_message = generate_commit_message(diff_text)
    print(f"\nSuggested commit message:")
    print(f"  {commit_message}")

if __name__ == "__main__":
    main()

Usage Examples

Generate from staged git changes:
```
python generate_commit.py
```

Generate from a diff file:

python generate_commit.py my_changes.diff

Use in your own code:

from generate_commit import generate_commit_message

diff = """diff --git a/app.py b/app.py
index e69de29..f4c3b4a 100644
--- a/app.py
+++ b/app.py
@@ -0,0 +1,3 @@
+def add(a, b):
+    return a + b
"""

message = generate_commit_message(diff)
print(message)

Examples

Input (user message content as formatted in training data):

Generate a concise and descriptive commit message for this git diff:

```diff
diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
index <HASH>..<HASH> 100644
--- a/src/ossos-pipeline/scripts/update_astrometry.py
+++ b/src/ossos-pipeline/scripts/update_astrometry.py
@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
     cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
     cutout.zmag = new_zp
 
+    if math.fabs(new_zp - old_zp) > 0.3:
+        logging.warning("Large change in zeropoint detected: {}  -> {}".format(old_zp, new_zp))
+
     try:
-        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
+        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
         (x, y) = cutout.get_observed_coordinates((x, y))
     except:
         logging.warn("Failed to do photometry.")

Possible outputs:

fix: use new_zp instead of old_zp for magnitude calculation and add zeropoint change warning
fix: correct zeropoint usage in photometry and add warning for large zeropoint changes
refactor: update magnitude calculation to use new zeropoint and add change detection

Training summary

Base model: google/gemma-3-270m-it (Gemma 3, 270M, instruction-tuned).
Method: LoRA fine-tuning with MLX (mlx_lm lora). Prompt masking was enabled so the model learns from the assistant response.
Training data: data/train_gpt-oss-20b.jsonl in this repository - a dataset converted to chat format with diffs fenced as ```diff and English, single-line commit messages as targets. This dataset is Python-focused.
Data format: Each training example uses the exact user prompt format shown above in the chat template structure.
Context/config highlights: max sequence length ~2048 tokens; diffs truncated to ~3,000 characters during preprocessing/inference to be model-friendly.
Important: To achieve best results, match the exact input format used in the training data.

Evaluation

The repo includes a lightweight evaluation that compares generated messages to a reference using a simple string similarity (SequenceMatcher) across multiple runs (varying the RNG seed). Results and artifacts are saved under evaluation_results/.

Limitations and risks

Diff size sensitivity: Very large diffs may be truncated; consider summarizing large changes.
Domain bias: Training set emphasized Python diffs; behavior may be better for Python-heavy repos.
Hallucinations: As with any LLM, may produce generic or mismatched messages if the diff is ambiguous.
Security: Do not feed secrets; generated text may inadvertently paraphrase sensitive context.
Language: System prompt enforces English responses.

Intended use

Assist developers by proposing a concise commit subject from a given git diff.
Not a replacement for human judgment; review messages before committing.

How to format inputs yourself

If you’re not using the CLI helpers, follow this structure with the Gemma chat template:

system: English-only instruction for commit message generation (see above)
user: instruction + the diff in ```diff code fences
assistant: the target single-line subject (for training) or left empty (for inference)

The repository’s format_commit_message_prompt builds the correct prompt for Gemma 3.

License and credits

Base model: Google Gemma 3 (google/gemma-3-270m-it). Use subject to the Gemma license terms.
Fine-tuning code: MLX and utilities in this repository. See repository license for details.

Downloads last month: 45

Safetensors

Model size

0.4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hks350d/git-diff-to-commit-gemma-3-270m

Base model

google/gemma-3-270m

Finetuned

google/gemma-3-270m-it

Quantized

(62)

this model

hks350d
/

git-diff-to-commit-gemma-3-270m

Git Diff -> Commit Message (Gemma 3 270M IT + LoRA)

Requirements

What this model expects (most important)

Training Data Format

Output

Quick usage

Python Script (MLX)

Usage Examples

Examples

Training summary

Evaluation

Limitations and risks

Intended use

How to format inputs yourself

License and credits

Model tree for hks350d/git-diff-to-commit-gemma-3-270m

Datasets used to train hks350d/git-diff-to-commit-gemma-3-270m

Git Diff -> Commit Message (Gemma 3 270M IT + LoRA)

Requirements

What this model expects (most important)

Training Data Format

Output

Quick usage

Python Script (MLX)

Usage Examples

Examples

Training summary

Evaluation

Limitations and risks

Intended use

How to format inputs yourself

License and credits

Model tree for hks350d/git-diff-to-commit-gemma-3-270m

Datasets used to train hks350d/git-diff-to-commit-gemma-3-270m

🎉 Free Image Generator Now Available!