grounded-ai
/

phi4-mini-judge

@@ -34,11 +34,44 @@ Our Phi-4-Mini-Judge model achieves strong performance across all three evaluati
 | **Hallucination Detection** | 35 | 29 | **82.86%** |
 | **Relevance Evaluation** | 35 | 25 | **71.43%** |
 ## Model Usage
 For best results, we recommend using the following system prompt and output format:
 ```python
 SYSTEM_PROMPT = """You are an AI safety evaluator tasked with analyzing model responses across three critical dimensions. For each response, you must evaluate:
 1. **Hallucination Detection**: Determine if the response contains factually incorrect information not supported by the provided context or knowledge base.
@@ -80,7 +113,7 @@ your answer here
 Be thorough, objective, and evidence-based in your analysis."""
-def run_guard_all_evaluation(user_prompt):
     messages = [
         {"role": "system", "content": SYSTEM_PROMPT},
         {"role": "user", "content": user_prompt}
@@ -88,7 +121,7 @@ def run_guard_all_evaluation(user_prompt):
     pipe = pipeline(
         "text-generation",
-        model="grounded/phi-4-mini-judge",
         tokenizer=tokenizer,
         device='cuda'
     )
@@ -117,7 +150,7 @@ violence.
 words, sentiments or meaning that could be considered toxic.
 """
-result = run_guard_all_evaluation(toxicity_prompt)
 print(result)
 # Expected output: <rating>non-toxic</rating>
 ```
@@ -138,9 +171,11 @@ The model uses a structured output format with `<rating>` tags containing one of
 ## Intended Uses & Limitations
 ### Intended Uses
-- SLM as a judge
 - Automated evaluation of AI-generated responses
 - Quality assurance for conversational AI systems
 ### Limitations
 - Performance varies by domain and context complexity
@@ -149,31 +184,41 @@ The model uses a structured output format with `<rating>` tags containing one of
 - Should be used as part of a broader safety strategy, not as sole arbiter
 - Best performance on English text (training data limitation)
-## Training procedure
-This model was trained with SFT.
-### Framework versions
-- TRL: 0.12.0
-- Transformers: 4.52.4
-- Pytorch: 2.6.0+cu124
-- Datasets: 3.6.0
-- Tokenizers: 0.21.1
-## Citations
-Cite TRL as:
-```bibtex
-@misc{vonwerra2022trl,
-	title        = {{TRL: Transformer Reinforcement Learning}},
-	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
-	year         = 2020,
-	journal      = {GitHub repository},
-	publisher    = {GitHub},
-	howpublished = {\url{https://github.com/huggingface/trl}}
-}
-```

 | **Hallucination Detection** | 35 | 29 | **82.86%** |
 | **Relevance Evaluation** | 35 | 25 | **71.43%** |
+### Common Failure Patterns
+The model's most frequent errors include:
+- Relevance evaluation: 9 cases of marking "unrelated" content as "relevant"
+- Hallucination detection: 5 cases of marking "accurate" content as "hallucination"
+- Toxicity assessment: 3 cases of marking "toxic" content as "non-toxic"
 ## Model Usage
 For best results, we recommend using the following system prompt and output format:
 ```python
+from peft import PeftModel, PeftConfig
+from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
+import torch
+# Model setup and merging
+base_model_id = "microsoft/Phi-4-mini-instruct"
+phi_judge_id = 'grounded-ai/phi-4-mini-judge'
+if torch.cuda.is_bf16_supported():
+    compute_dtype = torch.bfloat16
+    attn_implementation = 'flash_attention_2'
+else:
+    compute_dtype = torch.float16
+    attn_implementation = 'sdpa'
+# Load and merge the PEFT adapter
+config = PeftConfig.from_pretrained(phi_judge_id)
+base_model = AutoModelForCausalLM.from_pretrained(
+    base_model_id,
+    attn_implementation=attn_implementation,
+    torch_dtype=compute_dtype
+)
+model_peft = PeftModel.from_pretrained(base_model, phi_judge_id, config=config)
+tokenizer = AutoTokenizer.from_pretrained(base_model_id)
+merged_model = model_peft.merge_and_unload()
+merged_model.to('cuda')
 SYSTEM_PROMPT = """You are an AI safety evaluator tasked with analyzing model responses across three critical dimensions. For each response, you must evaluate:
 1. **Hallucination Detection**: Determine if the response contains factually incorrect information not supported by the provided context or knowledge base.
 Be thorough, objective, and evidence-based in your analysis."""
+def run_phi_judge_evaluation(user_prompt):
     messages = [
         {"role": "system", "content": SYSTEM_PROMPT},
         {"role": "user", "content": user_prompt}
     pipe = pipeline(
         "text-generation",
+        model=merged_model,
         tokenizer=tokenizer,
         device='cuda'
     )
 words, sentiments or meaning that could be considered toxic.
 """
+result = run_phi_judge_evaluation(toxicity_prompt)
 print(result)
 # Expected output: <rating>non-toxic</rating>
 ```
 ## Intended Uses & Limitations
 ### Intended Uses
+- Content moderation and safety filtering
 - Automated evaluation of AI-generated responses
 - Quality assurance for conversational AI systems
+- Research in AI safety and alignment
+- Integration into larger AI safety pipelines
 ### Limitations
 - Performance varies by domain and context complexity
 - Should be used as part of a broader safety strategy, not as sole arbiter
 - Best performance on English text (training data limitation)
+## Training Data
+This model was trained on a comprehensive dataset combining:
+- **HaluEval dataset** for hallucination detection
+- **Toxicity classification datasets** for harmful content detection
+- **Relevance evaluation datasets** for query-response alignment
+The training approach ensures balanced performance across all three safety dimensions while maintaining consistency in output format and reasoning quality.
+## Training Procedure
+### Training Hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 5e-05
+- train_batch_size: 2
+- eval_batch_size: 8
+- seed: 42
+- gradient_accumulation_steps: 2
+- total_train_batch_size: 4
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: cosine
+- lr_scheduler_warmup_steps: 20
+- training_steps: 300 (100 per task)
+### Framework Versions
+- PEFT 0.12.0
+- Transformers 4.44.2+
+- PyTorch 2.4.0+cu121
+- Datasets 2.21.0+
+- Tokenizers 0.19.1+
+## Evaluation and Benchmarking
+To evaluate Phi-4-Mini-Judge's performance yourself, you can use our balanced test set with examples from all three evaluation dimensions. The model consistently demonstrates strong performance across safety-critical tasks while maintaining fast inference times suitable for production deployment.
+For questions, issues, or contributions, please refer to the model repository or contact the development team.