Update README.md
Browse files
README.md
CHANGED
@@ -34,12 +34,6 @@ Our Phi-4-Mini-Judge model achieves strong performance across all three evaluati
|
|
34 |
| **Hallucination Detection** | 35 | 29 | **82.86%** |
|
35 |
| **Relevance Evaluation** | 35 | 25 | **71.43%** |
|
36 |
|
37 |
-
### Common Failure Patterns
|
38 |
-
The model's most frequent errors include:
|
39 |
-
- Relevance evaluation: 9 cases of marking "unrelated" content as "relevant"
|
40 |
-
- Hallucination detection: 5 cases of marking "accurate" content as "hallucination"
|
41 |
-
- Toxicity assessment: 3 cases of marking "toxic" content as "non-toxic"
|
42 |
-
|
43 |
## Model Usage
|
44 |
|
45 |
For best results, we recommend using the following system prompt and output format:
|
@@ -171,10 +165,9 @@ The model uses a structured output format with `<rating>` tags containing one of
|
|
171 |
## Intended Uses & Limitations
|
172 |
|
173 |
### Intended Uses
|
174 |
-
-
|
175 |
- Automated evaluation of AI-generated responses
|
176 |
- Quality assurance for conversational AI systems
|
177 |
-
- Research in AI safety and alignment
|
178 |
- Integration into larger AI safety pipelines
|
179 |
|
180 |
### Limitations
|
@@ -184,31 +177,6 @@ The model uses a structured output format with `<rating>` tags containing one of
|
|
184 |
- Should be used as part of a broader safety strategy, not as sole arbiter
|
185 |
- Best performance on English text (training data limitation)
|
186 |
|
187 |
-
## Training Data
|
188 |
-
|
189 |
-
This model was trained on a comprehensive dataset combining:
|
190 |
-
- **HaluEval dataset** for hallucination detection
|
191 |
-
- **Toxicity classification datasets** for harmful content detection
|
192 |
-
- **Relevance evaluation datasets** for query-response alignment
|
193 |
-
|
194 |
-
The training approach ensures balanced performance across all three safety dimensions while maintaining consistency in output format and reasoning quality.
|
195 |
-
|
196 |
-
## Training Procedure
|
197 |
-
|
198 |
-
### Training Hyperparameters
|
199 |
-
|
200 |
-
The following hyperparameters were used during training:
|
201 |
-
- learning_rate: 5e-05
|
202 |
-
- train_batch_size: 2
|
203 |
-
- eval_batch_size: 8
|
204 |
-
- seed: 42
|
205 |
-
- gradient_accumulation_steps: 2
|
206 |
-
- total_train_batch_size: 4
|
207 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
208 |
-
- lr_scheduler_type: cosine
|
209 |
-
- lr_scheduler_warmup_steps: 20
|
210 |
-
- training_steps: 300 (100 per task)
|
211 |
-
|
212 |
### Framework Versions
|
213 |
|
214 |
- PEFT 0.12.0
|
|
|
34 |
| **Hallucination Detection** | 35 | 29 | **82.86%** |
|
35 |
| **Relevance Evaluation** | 35 | 25 | **71.43%** |
|
36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
## Model Usage
|
38 |
|
39 |
For best results, we recommend using the following system prompt and output format:
|
|
|
165 |
## Intended Uses & Limitations
|
166 |
|
167 |
### Intended Uses
|
168 |
+
- SLM as a Judge
|
169 |
- Automated evaluation of AI-generated responses
|
170 |
- Quality assurance for conversational AI systems
|
|
|
171 |
- Integration into larger AI safety pipelines
|
172 |
|
173 |
### Limitations
|
|
|
177 |
- Should be used as part of a broader safety strategy, not as sole arbiter
|
178 |
- Best performance on English text (training data limitation)
|
179 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
180 |
### Framework Versions
|
181 |
|
182 |
- PEFT 0.12.0
|