---
license: gemma
datasets:
- hks350d/commit-message-generation
- Maxscha/commitbench
language:
- en
base_model:
- google/gemma-3-270m-it
---
# Git Diff -> Commit Message (Gemma 3 270M IT + LoRA)

A small, fast model specialized to turn a git diff into a concise, English commit message. Built on top of `google/gemma-3-270m-it` and fine-tuned with LoRA using MLX on macOS.

## Requirements

- macOS with Apple Silicon (for MLX)
- Python 3.8+
- Required packages:
  ```bash
  pip install mlx-lm transformers
  ```

## What this model expects (most important)

- Input type: a unified git diff as plain text.
- Wrap the diff in a Markdown code fence labeled `diff` for best results.
- The diff should look like the output of `git diff --no-color` (hunk headers like `@@`, `+`/`-` line prefixes, file headers, etc.).
- Keep diffs reasonably sized. The training/CLI path truncates diffs to ~3,000 characters and trains/infers with a context window of ~2,048 tokens. Extremely large diffs should be summarized or sampled.
- Language of response: English only. The system prompt enforces English output.

### Training Data Format

This model was trained on the `data/train_gpt-oss-20b.jsonl` dataset in this repository. The training data uses Gemma's chat template format with the following exact structure:

**User prompt format (as seen in training data):**
```
Generate a concise and descriptive commit message for this git diff:

```diff
diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
index <HASH>..<HASH> 100644
--- a/src/ossos-pipeline/scripts/update_astrometry.py
+++ b/src/ossos-pipeline/scripts/update_astrometry.py
@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
     cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
     cutout.zmag = new_zp
 
+    if math.fabs(new_zp - old_zp) > 0.3:
+        logging.warning("Large change in zeropoint detected: {}  -> {}".format(old_zp, new_zp))
+
     try:
-        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
+        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
         (x, y) = cutout.get_observed_coordinates((x, y))
     except:
         logging.warn("Failed to do photometry.")
```
```

**Important:** To get the best results, match this exact format including:
- The instruction text: "Generate a concise and descriptive commit message for this git diff:"
- The double newline after the instruction
- The diff wrapped in triple backticks with `diff` language tag
- Hash placeholders shown as `<HASH>..<HASH>` in the diff headers

### Chat template (Gemma 3)
The model was trained using Gemma's chat template with the system prompt enforcing English-only responses. The conceptual structure is:

- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
- user: The exact format shown above
- assistant: single-line commit message (target)

### Chat template (Gemma 3)
The model was trained and inferred using Gemma’s chat template. Conceptually:

- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
- user: "Generate a concise and descriptive commit message for this git diff:" + the diff wrapped in ```diff fences
- assistant: single-line commit message (target)

Training data (chat format) examples were stored like:

```json
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."},
    {"role": "user", "content": "Generate a concise and descriptive commit message for this git diff:\n\n```diff\n<diff text>\n```"},
    {"role": "assistant", "content": "<single-line commit message>"}
  ]
}
```

## Output

- A single-line commit subject, in English.
- The CLI post-processes the generation and returns the first non-empty line.
- Keep it concise and descriptive; optionally target ~50–72 characters where possible.

## Quick usage

### Python Script (MLX)

Here's a complete standalone script to generate commit messages using this model:

```python
#!/usr/bin/env python3
"""
Standalone script to generate git commit messages using the fine-tuned Gemma model.
Requires: mlx-lm, transformers
Install with: pip install mlx-lm transformers
"""

import subprocess
import sys
from mlx_lm import load, generate
from transformers import AutoTokenizer

def get_staged_diff():
    """Get the staged git diff from the current repository."""
    try:
        result = subprocess.run(
            ['git', 'diff', '--staged', '--no-color'], 
            capture_output=True, text=True, check=True
        )
        return result.stdout.strip()
    except subprocess.CalledProcessError:
        print("Error: Could not get git diff. Make sure you're in a git repository with staged changes.")
        return None

def format_prompt(diff_text, tokenizer):
    """Format the diff into the exact training data format."""
    system_prompt = "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
    user_message = f"Generate a concise and descriptive commit message for this git diff:\n\n```diff\n{diff_text}\n```"
    
    # Format using Gemma chat template
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ]
    
    prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    return prompt

def generate_commit_message(diff_text, model_path="your-username/git-diff-to-commit-gemma-3-270m"):
    """Generate a commit message from a git diff."""
    
    # Load model and tokenizer
    print("Loading model...")
    model, mlx_tokenizer = load(model_path)
    hf_tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")
    
    # Format the prompt
    prompt = format_prompt(diff_text, hf_tokenizer)
    
    # Generate response
    print("Generating commit message...")
    response = generate(
        model, 
        mlx_tokenizer, 
        prompt=prompt, 
        max_tokens=100,
        temp=0.7,
        top_p=0.9,
        verbose=False
    )
    
    # Extract just the generated part (after the prompt)
    generated_text = response[len(prompt):].strip()
    
    # Return the first non-empty line
    lines = [line.strip() for line in generated_text.split('\n') if line.strip()]
    return lines[0] if lines else "Unable to generate commit message"

def main():
    """Main function - can be used with staged diff or provided diff text."""
    
    if len(sys.argv) > 1:
        # Use provided diff file
        diff_file = sys.argv[1]
        try:
            with open(diff_file, 'r') as f:
                diff_text = f.read().strip()
        except FileNotFoundError:
            print(f"Error: File {diff_file} not found.")
            return
    else:
        # Get staged diff from git
        diff_text = get_staged_diff()
        if not diff_text:
            print("No staged changes found. Stage some changes with 'git add' first.")
            return
    
    if not diff_text:
        print("No diff content to process.")
        return
    
    # Generate and print commit message
    commit_message = generate_commit_message(diff_text)
    print(f"\nSuggested commit message:")
    print(f"  {commit_message}")

if __name__ == "__main__":
    main()
```

### Usage Examples

1. **Generate from staged git changes:**
   ```bash
   python generate_commit.py
   ```

2. **Generate from a diff file:**
   ```bash
   python generate_commit.py my_changes.diff
   ```

3. **Use in your own code:**
   ```python
   from generate_commit import generate_commit_message
   
   diff = """diff --git a/app.py b/app.py
   index e69de29..f4c3b4a 100644
   --- a/app.py
   +++ b/app.py
   @@ -0,0 +1,3 @@
   +def add(a, b):
   +    return a + b
   """
   
   message = generate_commit_message(diff)
   print(message)
   ```

## Examples

Input (user message content as formatted in training data):

```
Generate a concise and descriptive commit message for this git diff:

```diff
diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
index <HASH>..<HASH> 100644
--- a/src/ossos-pipeline/scripts/update_astrometry.py
+++ b/src/ossos-pipeline/scripts/update_astrometry.py
@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
     cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
     cutout.zmag = new_zp
 
+    if math.fabs(new_zp - old_zp) > 0.3:
+        logging.warning("Large change in zeropoint detected: {}  -> {}".format(old_zp, new_zp))
+
     try:
-        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
+        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
         (x, y) = cutout.get_observed_coordinates((x, y))
     except:
         logging.warn("Failed to do photometry.")
```


Possible outputs:
- fix: use new_zp instead of old_zp for magnitude calculation and add zeropoint change warning
- fix: correct zeropoint usage in photometry and add warning for large zeropoint changes
- refactor: update magnitude calculation to use new zeropoint and add change detection

## Training summary

- Base model: `google/gemma-3-270m-it` (Gemma 3, 270M, instruction-tuned).
- Method: LoRA fine-tuning with MLX (`mlx_lm lora`). Prompt masking was enabled so the model learns from the assistant response.
- **Training data**: `data/train_gpt-oss-20b.jsonl` in this repository - a dataset converted to chat format with diffs fenced as ```diff and English, single-line commit messages as targets. This dataset is Python-focused.
- Data format: Each training example uses the exact user prompt format shown above in the chat template structure.
- Context/config highlights: max sequence length ~2048 tokens; diffs truncated to ~3,000 characters during preprocessing/inference to be model-friendly.
- **Important**: To achieve best results, match the exact input format used in the training data.

## Evaluation

- The repo includes a lightweight evaluation that compares generated messages to a reference using a simple string similarity (SequenceMatcher) across multiple runs (varying the RNG seed). Results and artifacts are saved under `evaluation_results/`.

## Limitations and risks

- Diff size sensitivity: Very large diffs may be truncated; consider summarizing large changes.
- Domain bias: Training set emphasized Python diffs; behavior may be better for Python-heavy repos.
- Hallucinations: As with any LLM, may produce generic or mismatched messages if the diff is ambiguous.
- Security: Do not feed secrets; generated text may inadvertently paraphrase sensitive context.
- Language: System prompt enforces English responses.

## Intended use

- Assist developers by proposing a concise commit subject from a given git diff.
- Not a replacement for human judgment; review messages before committing.

## How to format inputs yourself

If you’re not using the CLI helpers, follow this structure with the Gemma chat template:

- system: English-only instruction for commit message generation (see above)
- user: instruction + the diff in ```diff code fences
- assistant: the target single-line subject (for training) or left empty (for inference)

The repository’s `format_commit_message_prompt` builds the correct prompt for Gemma 3.

## License and credits

- Base model: Google Gemma 3 (`google/gemma-3-270m-it`). Use subject to the Gemma license terms.
- Fine-tuning code: MLX and utilities in this repository. See repository license for details.