Git Diff -> Commit Message (Gemma 3 270M IT + LoRA)
A small, fast model specialized to turn a git diff into a concise, English commit message. Built on top of google/gemma-3-270m-it
and fine-tuned with LoRA using MLX on macOS.
Requirements
- macOS with Apple Silicon (for MLX)
- Python 3.8+
- Required packages:
pip install mlx-lm transformers
What this model expects (most important)
- Input type: a unified git diff as plain text.
- Wrap the diff in a Markdown code fence labeled
diff
for best results. - The diff should look like the output of
git diff --no-color
(hunk headers like@@
,+
/-
line prefixes, file headers, etc.). - Keep diffs reasonably sized. The training/CLI path truncates diffs to ~3,000 characters and trains/infers with a context window of ~2,048 tokens. Extremely large diffs should be summarized or sampled.
- Language of response: English only. The system prompt enforces English output.
Training Data Format
This model was trained on the data/train_gpt-oss-20b.jsonl
dataset in this repository. The training data uses Gemma's chat template format with the following exact structure:
User prompt format (as seen in training data):
Generate a concise and descriptive commit message for this git diff:
```diff
diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
index <HASH>..<HASH> 100644
--- a/src/ossos-pipeline/scripts/update_astrometry.py
+++ b/src/ossos-pipeline/scripts/update_astrometry.py
@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
cutout.zmag = new_zp
+ if math.fabs(new_zp - old_zp) > 0.3:
+ logging.warning("Large change in zeropoint detected: {} -> {}".format(old_zp, new_zp))
+
try:
- (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
+ (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
(x, y) = cutout.get_observed_coordinates((x, y))
except:
logging.warn("Failed to do photometry.")
**Important:** To get the best results, match this exact format including:
- The instruction text: "Generate a concise and descriptive commit message for this git diff:"
- The double newline after the instruction
- The diff wrapped in triple backticks with `diff` language tag
- Hash placeholders shown as `<HASH>..<HASH>` in the diff headers
### Chat template (Gemma 3)
The model was trained using Gemma's chat template with the system prompt enforcing English-only responses. The conceptual structure is:
- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
- user: The exact format shown above
- assistant: single-line commit message (target)
### Chat template (Gemma 3)
The model was trained and inferred using Gemma’s chat template. Conceptually:
- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
- user: "Generate a concise and descriptive commit message for this git diff:" + the diff wrapped in ```diff fences
- assistant: single-line commit message (target)
Training data (chat format) examples were stored like:
```json
{
"messages": [
{"role": "system", "content": "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."},
{"role": "user", "content": "Generate a concise and descriptive commit message for this git diff:\n\n```diff\n<diff text>\n```"},
{"role": "assistant", "content": "<single-line commit message>"}
]
}
Output
- A single-line commit subject, in English.
- The CLI post-processes the generation and returns the first non-empty line.
- Keep it concise and descriptive; optionally target ~50–72 characters where possible.
Quick usage
Python Script (MLX)
Here's a complete standalone script to generate commit messages using this model:
#!/usr/bin/env python3
"""
Standalone script to generate git commit messages using the fine-tuned Gemma model.
Requires: mlx-lm, transformers
Install with: pip install mlx-lm transformers
"""
import subprocess
import sys
from mlx_lm import load, generate
from transformers import AutoTokenizer
def get_staged_diff():
"""Get the staged git diff from the current repository."""
try:
result = subprocess.run(
['git', 'diff', '--staged', '--no-color'],
capture_output=True, text=True, check=True
)
return result.stdout.strip()
except subprocess.CalledProcessError:
print("Error: Could not get git diff. Make sure you're in a git repository with staged changes.")
return None
def format_prompt(diff_text, tokenizer):
"""Format the diff into the exact training data format."""
system_prompt = "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
user_message = f"Generate a concise and descriptive commit message for this git diff:\n\n```diff\n{diff_text}\n```"
# Format using Gemma chat template
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
return prompt
def generate_commit_message(diff_text, model_path="your-username/git-diff-to-commit-gemma-3-270m"):
"""Generate a commit message from a git diff."""
# Load model and tokenizer
print("Loading model...")
model, mlx_tokenizer = load(model_path)
hf_tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")
# Format the prompt
prompt = format_prompt(diff_text, hf_tokenizer)
# Generate response
print("Generating commit message...")
response = generate(
model,
mlx_tokenizer,
prompt=prompt,
max_tokens=100,
temp=0.7,
top_p=0.9,
verbose=False
)
# Extract just the generated part (after the prompt)
generated_text = response[len(prompt):].strip()
# Return the first non-empty line
lines = [line.strip() for line in generated_text.split('\n') if line.strip()]
return lines[0] if lines else "Unable to generate commit message"
def main():
"""Main function - can be used with staged diff or provided diff text."""
if len(sys.argv) > 1:
# Use provided diff file
diff_file = sys.argv[1]
try:
with open(diff_file, 'r') as f:
diff_text = f.read().strip()
except FileNotFoundError:
print(f"Error: File {diff_file} not found.")
return
else:
# Get staged diff from git
diff_text = get_staged_diff()
if not diff_text:
print("No staged changes found. Stage some changes with 'git add' first.")
return
if not diff_text:
print("No diff content to process.")
return
# Generate and print commit message
commit_message = generate_commit_message(diff_text)
print(f"\nSuggested commit message:")
print(f" {commit_message}")
if __name__ == "__main__":
main()
Usage Examples
Generate from staged git changes:
python generate_commit.py
Generate from a diff file:
python generate_commit.py my_changes.diff
Use in your own code:
from generate_commit import generate_commit_message diff = """diff --git a/app.py b/app.py index e69de29..f4c3b4a 100644 --- a/app.py +++ b/app.py @@ -0,0 +1,3 @@ +def add(a, b): + return a + b """ message = generate_commit_message(diff) print(message)
Examples
Input (user message content as formatted in training data):
Generate a concise and descriptive commit message for this git diff:
```diff
diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
index <HASH>..<HASH> 100644
--- a/src/ossos-pipeline/scripts/update_astrometry.py
+++ b/src/ossos-pipeline/scripts/update_astrometry.py
@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
cutout.zmag = new_zp
+ if math.fabs(new_zp - old_zp) > 0.3:
+ logging.warning("Large change in zeropoint detected: {} -> {}".format(old_zp, new_zp))
+
try:
- (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
+ (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
(x, y) = cutout.get_observed_coordinates((x, y))
except:
logging.warn("Failed to do photometry.")
Possible outputs:
- fix: use new_zp instead of old_zp for magnitude calculation and add zeropoint change warning
- fix: correct zeropoint usage in photometry and add warning for large zeropoint changes
- refactor: update magnitude calculation to use new zeropoint and add change detection
Training summary
- Base model:
google/gemma-3-270m-it
(Gemma 3, 270M, instruction-tuned). - Method: LoRA fine-tuning with MLX (
mlx_lm lora
). Prompt masking was enabled so the model learns from the assistant response. - Training data:
data/train_gpt-oss-20b.jsonl
in this repository - a dataset converted to chat format with diffs fenced as ```diff and English, single-line commit messages as targets. This dataset is Python-focused. - Data format: Each training example uses the exact user prompt format shown above in the chat template structure.
- Context/config highlights: max sequence length ~2048 tokens; diffs truncated to ~3,000 characters during preprocessing/inference to be model-friendly.
- Important: To achieve best results, match the exact input format used in the training data.
Evaluation
- The repo includes a lightweight evaluation that compares generated messages to a reference using a simple string similarity (SequenceMatcher) across multiple runs (varying the RNG seed). Results and artifacts are saved under
evaluation_results/
.
Limitations and risks
- Diff size sensitivity: Very large diffs may be truncated; consider summarizing large changes.
- Domain bias: Training set emphasized Python diffs; behavior may be better for Python-heavy repos.
- Hallucinations: As with any LLM, may produce generic or mismatched messages if the diff is ambiguous.
- Security: Do not feed secrets; generated text may inadvertently paraphrase sensitive context.
- Language: System prompt enforces English responses.
Intended use
- Assist developers by proposing a concise commit subject from a given git diff.
- Not a replacement for human judgment; review messages before committing.
How to format inputs yourself
If you’re not using the CLI helpers, follow this structure with the Gemma chat template:
- system: English-only instruction for commit message generation (see above)
- user: instruction + the diff in ```diff code fences
- assistant: the target single-line subject (for training) or left empty (for inference)
The repository’s format_commit_message_prompt
builds the correct prompt for Gemma 3.
License and credits
- Base model: Google Gemma 3 (
google/gemma-3-270m-it
). Use subject to the Gemma license terms. - Fine-tuning code: MLX and utilities in this repository. See repository license for details.
- Downloads last month
- 129