--- license: gemma datasets: - hks350d/commit-message-generation - Maxscha/commitbench language: - en base_model: - google/gemma-3-270m-it --- # Git Diff -> Commit Message (Gemma 3 270M IT + LoRA) A small, fast model specialized to turn a git diff into a concise, English commit message. Built on top of `google/gemma-3-270m-it` and fine-tuned with LoRA using MLX on macOS. ## Requirements - macOS with Apple Silicon (for MLX) - Python 3.8+ - Required packages: ```bash pip install mlx-lm transformers ``` ## What this model expects (most important) - Input type: a unified git diff as plain text. - Wrap the diff in a Markdown code fence labeled `diff` for best results. - The diff should look like the output of `git diff --no-color` (hunk headers like `@@`, `+`/`-` line prefixes, file headers, etc.). - Keep diffs reasonably sized. The training/CLI path truncates diffs to ~3,000 characters and trains/infers with a context window of ~2,048 tokens. Extremely large diffs should be summarized or sampled. - Language of response: English only. The system prompt enforces English output. ### Training Data Format This model was trained on the `data/train_gpt-oss-20b.jsonl` dataset in this repository. The training data uses Gemma's chat template format with the following exact structure: **User prompt format (as seen in training data):** ``` Generate a concise and descriptive commit message for this git diff: ```diff diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py index .. 100644 --- a/src/ossos-pipeline/scripts/update_astrometry.py +++ b/src/ossos-pipeline/scripts/update_astrometry.py @@ -159,8 +159,11 @@ def recompute_mag(mpc_in): cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True) cutout.zmag = new_zp + if math.fabs(new_zp - old_zp) > 0.3: + logging.warning("Large change in zeropoint detected: {} -> {}".format(old_zp, new_zp)) + try: - (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp) + (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp) (x, y) = cutout.get_observed_coordinates((x, y)) except: logging.warn("Failed to do photometry.") ``` ``` **Important:** To get the best results, match this exact format including: - The instruction text: "Generate a concise and descriptive commit message for this git diff:" - The double newline after the instruction - The diff wrapped in triple backticks with `diff` language tag - Hash placeholders shown as `..` in the diff headers ### Chat template (Gemma 3) The model was trained using Gemma's chat template with the system prompt enforcing English-only responses. The conceptual structure is: - system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language." - user: The exact format shown above - assistant: single-line commit message (target) ### Chat template (Gemma 3) The model was trained and inferred using Gemma’s chat template. Conceptually: - system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language." - user: "Generate a concise and descriptive commit message for this git diff:" + the diff wrapped in ```diff fences - assistant: single-line commit message (target) Training data (chat format) examples were stored like: ```json { "messages": [ {"role": "system", "content": "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."}, {"role": "user", "content": "Generate a concise and descriptive commit message for this git diff:\n\n```diff\n\n```"}, {"role": "assistant", "content": ""} ] } ``` ## Output - A single-line commit subject, in English. - The CLI post-processes the generation and returns the first non-empty line. - Keep it concise and descriptive; optionally target ~50–72 characters where possible. ## Quick usage ### Python Script (MLX) Here's a complete standalone script to generate commit messages using this model: ```python #!/usr/bin/env python3 """ Standalone script to generate git commit messages using the fine-tuned Gemma model. Requires: mlx-lm, transformers Install with: pip install mlx-lm transformers """ import subprocess import sys from mlx_lm import load, generate from transformers import AutoTokenizer def get_staged_diff(): """Get the staged git diff from the current repository.""" try: result = subprocess.run( ['git', 'diff', '--staged', '--no-color'], capture_output=True, text=True, check=True ) return result.stdout.strip() except subprocess.CalledProcessError: print("Error: Could not get git diff. Make sure you're in a git repository with staged changes.") return None def format_prompt(diff_text, tokenizer): """Format the diff into the exact training data format.""" system_prompt = "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language." user_message = f"Generate a concise and descriptive commit message for this git diff:\n\n```diff\n{diff_text}\n```" # Format using Gemma chat template messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_message} ] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) return prompt def generate_commit_message(diff_text, model_path="your-username/git-diff-to-commit-gemma-3-270m"): """Generate a commit message from a git diff.""" # Load model and tokenizer print("Loading model...") model, mlx_tokenizer = load(model_path) hf_tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it") # Format the prompt prompt = format_prompt(diff_text, hf_tokenizer) # Generate response print("Generating commit message...") response = generate( model, mlx_tokenizer, prompt=prompt, max_tokens=100, temp=0.7, top_p=0.9, verbose=False ) # Extract just the generated part (after the prompt) generated_text = response[len(prompt):].strip() # Return the first non-empty line lines = [line.strip() for line in generated_text.split('\n') if line.strip()] return lines[0] if lines else "Unable to generate commit message" def main(): """Main function - can be used with staged diff or provided diff text.""" if len(sys.argv) > 1: # Use provided diff file diff_file = sys.argv[1] try: with open(diff_file, 'r') as f: diff_text = f.read().strip() except FileNotFoundError: print(f"Error: File {diff_file} not found.") return else: # Get staged diff from git diff_text = get_staged_diff() if not diff_text: print("No staged changes found. Stage some changes with 'git add' first.") return if not diff_text: print("No diff content to process.") return # Generate and print commit message commit_message = generate_commit_message(diff_text) print(f"\nSuggested commit message:") print(f" {commit_message}") if __name__ == "__main__": main() ``` ### Usage Examples 1. **Generate from staged git changes:** ```bash python generate_commit.py ``` 2. **Generate from a diff file:** ```bash python generate_commit.py my_changes.diff ``` 3. **Use in your own code:** ```python from generate_commit import generate_commit_message diff = """diff --git a/app.py b/app.py index e69de29..f4c3b4a 100644 --- a/app.py +++ b/app.py @@ -0,0 +1,3 @@ +def add(a, b): + return a + b """ message = generate_commit_message(diff) print(message) ``` ## Examples Input (user message content as formatted in training data): ``` Generate a concise and descriptive commit message for this git diff: ```diff diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py index .. 100644 --- a/src/ossos-pipeline/scripts/update_astrometry.py +++ b/src/ossos-pipeline/scripts/update_astrometry.py @@ -159,8 +159,11 @@ def recompute_mag(mpc_in): cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True) cutout.zmag = new_zp + if math.fabs(new_zp - old_zp) > 0.3: + logging.warning("Large change in zeropoint detected: {} -> {}".format(old_zp, new_zp)) + try: - (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp) + (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp) (x, y) = cutout.get_observed_coordinates((x, y)) except: logging.warn("Failed to do photometry.") ``` Possible outputs: - fix: use new_zp instead of old_zp for magnitude calculation and add zeropoint change warning - fix: correct zeropoint usage in photometry and add warning for large zeropoint changes - refactor: update magnitude calculation to use new zeropoint and add change detection ## Training summary - Base model: `google/gemma-3-270m-it` (Gemma 3, 270M, instruction-tuned). - Method: LoRA fine-tuning with MLX (`mlx_lm lora`). Prompt masking was enabled so the model learns from the assistant response. - **Training data**: `data/train_gpt-oss-20b.jsonl` in this repository - a dataset converted to chat format with diffs fenced as ```diff and English, single-line commit messages as targets. This dataset is Python-focused. - Data format: Each training example uses the exact user prompt format shown above in the chat template structure. - Context/config highlights: max sequence length ~2048 tokens; diffs truncated to ~3,000 characters during preprocessing/inference to be model-friendly. - **Important**: To achieve best results, match the exact input format used in the training data. ## Evaluation - The repo includes a lightweight evaluation that compares generated messages to a reference using a simple string similarity (SequenceMatcher) across multiple runs (varying the RNG seed). Results and artifacts are saved under `evaluation_results/`. ## Limitations and risks - Diff size sensitivity: Very large diffs may be truncated; consider summarizing large changes. - Domain bias: Training set emphasized Python diffs; behavior may be better for Python-heavy repos. - Hallucinations: As with any LLM, may produce generic or mismatched messages if the diff is ambiguous. - Security: Do not feed secrets; generated text may inadvertently paraphrase sensitive context. - Language: System prompt enforces English responses. ## Intended use - Assist developers by proposing a concise commit subject from a given git diff. - Not a replacement for human judgment; review messages before committing. ## How to format inputs yourself If you’re not using the CLI helpers, follow this structure with the Gemma chat template: - system: English-only instruction for commit message generation (see above) - user: instruction + the diff in ```diff code fences - assistant: the target single-line subject (for training) or left empty (for inference) The repository’s `format_commit_message_prompt` builds the correct prompt for Gemma 3. ## License and credits - Base model: Google Gemma 3 (`google/gemma-3-270m-it`). Use subject to the Gemma license terms. - Fine-tuning code: MLX and utilities in this repository. See repository license for details.