|
--- |
|
license: llama3.3 |
|
language: |
|
- en |
|
base_model: |
|
- meta-llama/Llama-3.3-70B-Instruct |
|
pipeline_tag: text-generation |
|
tags: |
|
- llm-as-judge |
|
- evaluation |
|
--- |
|
# Model Card for RootSignals-Judge-Llama-70B |
|
|
|
**Root Judge** is a powerful mid-sized LLM that enables reliable and customizable LLM system evaluations. |
|
Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing. |
|
The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use. |
|
|
|
**Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and |
|
achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost. |
|
|
|
# 1. Intended Use Cases |
|
**Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as: |
|
- Detecting context-grounded hallucinations, e.g. for *Retrieval Augmented Generation* (RAG) settings in an explainable manner, providing a justification for the score |
|
- Pairwise preference judgments due to strong evaluation instruction-following capabilities |
|
- Serving as a custom evaluation metric powered by use case specific evaluation rubrics |
|
- Assisting inference-time search or synthetic data tasks that require Best-of-N decisions |
|
- Privacy-focused settings that require local deployments |
|
|
|
# 2. Performance Summary |
|
|
|
**Root Judge** outperforms leading closed models when detecting instruction following failures on evaluations |
|
while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public. |
|
|
|
## 2.1 Hallucination Detection (in RAG setting) |
|
|
|
📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench): |
|
|
|
Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($) |
|
| --- | --- | --- | --- | --- | |
|
**1** | **Root Judge** | 14900 | **86.3** | **3.98** |
|
2 | GPT-4o | 14900 | 86.1 | 33.12 |
|
3 | o1-preview | 14899 | 85.3 | 1062* |
|
4 | Claude Sonnet-3.5 | 14797 | 85.2 | 42.94 |
|
5 | Llama3.1-70b-Instruct| 13969 | 84.7 | 27.43 |
|
6 | o1-mini | 14655 | 83.7 | 156 |
|
7 | Llama3.1-405b-Instruct | 14881 | 83.6 | 269.82 |
|
|
|
`*`=benchmarked as o1-preview; at current o1 prices, without reasoning tokens, the cost would start at $198.74 instead |
|
Local Costs based on lambdalabs instances at January 2025 prices |
|
|
|
[🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing) |
|
|
|
## 2.2 Instruction Following |
|
|
|
📊 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better): |
|
|
|
Rank | Model | VRAM (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) | |
|
| ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------| |
|
**1** | **Root Judge** | 70 | **94.6 ± 0.6** | **93.9** | 52.8 ± 3.2 | 24.6 ± 2.7 | **56.8 ± 3.1** | **64.5** | 100 | |
|
2 | Llama-3.3-70B | 140 | 94.4 ± 0.6 | 93.4 | 54.0 ± 3.2 | 23.4 ± 2.7 | 56.0 ± 3.2 | 64.3 | 99.5 | |
|
3 | Patronus-70B | 140 | 91.7 ± 0.8 | 83.7 | 54.4 ± 3.2 | 24.6 ± 2.7 | 48.8 ± 3.2 | 60.6 | 93.9 | |
|
4 | Nemotron-70B | 70 | 80.1 ± 1.1 | 85.0 | 53.6 ± 3.2 | 23.8 ± 2.7 | 55.6 ± 3.1 | 59.6 | 92.4 | |
|
5 | Qwen-2.5-32B | 64 | 87.4 ± 0.9 | 87.5 | 58.8 ± 3.1 | 23.1 ± 2.6 | 45.2 ± 3.2 | 60.4 | 93.6 | |
|
6 | Flow Judge | 16 | 78.7 ± 1.1 | 64.6 | **60.8 ± 3.1** | 23.4 ± 2.7 | 35.6 ± 3.0 | 52.6 | 81.5 | |
|
7 | Glider | 8 | 78.7 ± 1.1 | 56.5 | 59.2 ± 3.1 | **35.9 ± 3.0** | 43.2 ± 3.1 | 54.7 | 84.8 | |
|
|
|
[🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing) |
|
|
|
## 2.3 Root Signals Internal Benchmarks |
|
|
|
📊 Benchmark: Root Signals Internal Hallucination Detection Benchmark |
|
|
|
 |
|
*Image 1: Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.* |
|
|
|
|
|
 |
|
*Image 2: Custom rubric instruction-following by high level task.* |
|
|
|
**Root Judge** was tested to support complex, user-defined scoring (rating) rubrics over large context sizes It provides granular qualitative feedback and supports structured evaluation outputs as well as tool calling. |
|
|
|
## 2.4 Other Benchmarks |
|
|
|
<details> |
|
<summary>📊 RewardBench</summary> |
|
|
|
[RewardBench](https://huggingface.co/spaces/allenai/reward-bench) |
|
|
|
| Benchmark Task | Score | Total | Accuracy | |
|
|------------------------|-------|-------|-----------| |
|
| alpacaeval-easy | 99.0 | 100 | 0.99 | |
|
| alpacaeval-hard | 93.0 | 95 | 0.97894737| |
|
| alpacaeval-length | 86.0 | 95 | 0.90526316| |
|
| donotanswer | 73.5 | 136 | 0.54044118| |
|
| hep-cpp | 159.0 | 164 | 0.96951220| |
|
| hep-go | 159.0 | 164 | 0.96951220| |
|
| hep-java | 161.0 | 164 | 0.98170732| |
|
| hep-js | 159.0 | 164 | 0.96951220| |
|
| hep-python | 158.0 | 164 | 0.96341463| |
|
| hep-rust | 152.0 | 164 | 0.92682927| |
|
| llmbar-adver-GPTInst | 69.0 | 92 | 0.75 | |
|
| llmbar-adver-GPTOut | 39.0 | 47 | 0.82978723| |
|
| llmbar-adver-manual | 32.0 | 46 | 0.69565217| |
|
| llmbar-adver-neighbor | 74.0 | 134 | 0.55223881| |
|
| llmbar-natural | 94.0 | 100 | 0.94 | |
|
| math-prm | 357.0 | 447 | 0.79865772| |
|
| mt-bench-easy | 28.0 | 28 | 1.0 | |
|
| mt-bench-hard | 32.0 | 37 | 0.86486486| |
|
| mt-bench-med | 40.0 | 40 | 1.0 | |
|
| refusals-dangerous | 73.5 | 100 | 0.735 | |
|
| refusals-offensive | 89.0 | 100 | 0.89 | |
|
| xstest-should-refuse | 140.5 | 154 | 0.91233766| |
|
| xstest-should-respond | 245.0 | 250 | 0.98 | |
|
| Chat | | | 0.96648045| |
|
| Chat Hard | | | 0.74561404| |
|
| Safety | | | 0.83986486| |
|
| Reasoning | | | 0.88103618| |
|
|
|
</details> |
|
|
|
Despite our main focus on nuanced and transparent judgement of candidate responses, |
|
we test the judge model checkpoints extensively on public and private benchmarks, |
|
to avoid known issues with performance drops such as catastrophic forgetting and find that the model |
|
preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization, |
|
while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR |
|
|
|
# 3. Getting Started |
|
|
|
## 3.1 Via Root Signals Python SDK |
|
|
|
Model is available on our [platform](https://app.rootsignals.ai/register?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals) as part of our evaluation suite, for no additional cost. |
|
|
|
Install our [python library](https://github.com/root-signals/rs-python-sdk): |
|
```bash |
|
pip install root-signals |
|
``` |
|
|
|
Import: |
|
```python |
|
from root import RootSignals |
|
client = RootSignals() |
|
``` |
|
|
|
Create a custom evaluator powered by **Root Judge**: |
|
```python |
|
my_custom_judge = client.evaluators.create( |
|
name="Political Text Evaluator", |
|
intent="To measure the politics-relatedness of a given text", |
|
predicate="Assess if a text containts political jargon or talks about politics: {{response}}", |
|
model="RootJudge", |
|
) |
|
``` |
|
|
|
Execute: |
|
```python |
|
result = my_custom_judge.run( |
|
response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee." |
|
) |
|
print(result.score) # normalized score between [0-1] |
|
print(result.justification) # detailed reasoning for the score |
|
``` |
|
|
|
## 3.2 Locally |
|
|
|
We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use-cases together with *xml tags* for important sections in your prompt. While the model can run on 80GB VRAM, we recommend at least 96GB for evaluating long-context RAG inputs. |
|
|
|
SGlang example for a single Nvidia H100 (80GB): |
|
```bash |
|
docker run \ |
|
--gpus all \ |
|
--ipc=host \ |
|
-p 8000:8000 \ |
|
-v huggingface:/root/.cache/huggingface \ |
|
--volume /etc/localtime:/etc/localtime:ro \ |
|
-d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \ |
|
python3 -m sglang.launch_server \ |
|
--model-path root-signals/RootSignals-Judge-Llama-70B \ |
|
--host 0.0.0.0 \ |
|
--port 8000 \ |
|
--mem-fraction-static 0.89 \ |
|
--grammar-backend xgrammar \ |
|
--enable-torch-compile \ |
|
--disable-cuda-graph |
|
``` |
|
|
|
We validated the model on arm64 with [vLLM](https://github.com/vllm-project/vllm) on Nvidia GH200 as well with max outputs up to 64k tokens: |
|
```bash |
|
docker run \ |
|
--gpus all \ |
|
--ipc=host \ |
|
-p 8000:8000 \ |
|
-v huggingface:/root/.cache/huggingface \ |
|
--volume /etc/localtime:/etc/localtime:ro \ |
|
-d drikster80/vllm-gh200-openai:v0.6.4.post1 \ |
|
--model root-signals/RootSignals-Judge-Llama-70B \ |
|
--gpu-memory-utilization 0.95 \ |
|
--max-model-len 64k \ |
|
--block_size 16 \ |
|
--enable_prefix_caching |
|
``` |
|
|
|
Detect hallucinations from context, example uses halubench: |
|
```python |
|
decompose_system_instruction = """ |
|
<TASK> |
|
You are a fair judge that detects hallucinations and unjustified assumptions from question-document-answer triplets provided by the user. |
|
Always follow the instructions below and provide your reasoning and verdict in the format specified. |
|
</TASK> |
|
|
|
<INSTRUCTIONS> |
|
#1. Identify key elements in the question. |
|
#2. List all relevant facts provided in the document. |
|
#3. Break down the answer into its component claims. |
|
#4. For each claim in the answer: |
|
#a. Is it explicitly supported by the document? If yes, quote the relevant part. |
|
#b. Is it a reasonable inference from the document? If yes, explain the reasoning. |
|
#c. Is it unsupported or contradicted by the document? If yes, explain why. |
|
#5. Check for any information in the answer that's present in the question but not in the document. |
|
#6. Verify that no additional information is introduced in the answer that isn't in the document or question. |
|
#7. Assess if the answer makes any unjustified connections or assumptions. |
|
</INSTRUCTIONS> |
|
|
|
<OUTPUT_EXAMPLE> |
|
{"REASONING": "Your reasoning here where you cite the instruction step by number and provide your reasoning", "VERDICT": "PASS" or "FAIL"} |
|
</OUTPUT_EXAMPLE> |
|
""" |
|
|
|
decompose_prompt = """ |
|
<QUESTION>: {question} </QUESTION> |
|
<DOCUMENT>: {document} </DOCUMENT> |
|
<ANSWER>: {answer} </ANSWER> |
|
""".strip() |
|
|
|
import os |
|
import json |
|
import pandas as pd |
|
from openai import OpenAI |
|
from pprint import pprint |
|
from pydantic import BaseModel |
|
|
|
testset_df = pd.read_parquet("hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet") |
|
testset_df = testset_df.sample(frac=1).reset_index(drop=True) |
|
example_row = testset_df.iloc[0] |
|
|
|
class DecomposeResponse(BaseModel): |
|
REASONING: str |
|
VERDICT: str |
|
|
|
client = OpenAI(base_url="http://localhost:8000/v1") # export a different one for e.g. sglang, openrouter, etc. |
|
|
|
response = client.beta.chat.completions.parse( |
|
model="root-signals/RootSignals-Judge-Llama-70B", # or `RootJudge` if you are using the RootSignals API |
|
messages=[ |
|
{"role": "system", "content": decompose_system_instruction}, |
|
{"role": "user", "content": decompose_prompt.format( |
|
question=example_row["question"], |
|
document=example_row["passage"], |
|
answer=example_row["answer"])}, |
|
], |
|
response_format=DecomposeResponse, |
|
).choices[0].message.parsed |
|
|
|
pprint(response.REASONING) |
|
pprint(response.VERDICT) |
|
``` |
|
|
|
``` |
|
> ('Following the instructions: #1, the key element in the question is the ' |
|
"nationality of the magazines. #2, the document states that 'The Woman's " |
|
"Viewpoint was a woman's magazine founded in Texas in 1923' and 'Pick Me Up! " |
|
"is a British weekly women's magazine'. #3, the answer claims both magazines " |
|
'are British. #4, checking each claim in the answer: a) The document does not ' |
|
"support the claim that The Woman's Viewpoint is British, instead, it says " |
|
"the magazine was founded in Texas. b) There's no reasonable inference from " |
|
"the document that would suggest The Woman's Viewpoint is British. c) The " |
|
"claim about The Woman's Viewpoint is contradicted by the document. #5, the " |
|
'answer introduces information (both being British) not supported by the ' |
|
'document. #6, additional information about both magazines being British is ' |
|
'introduced in the answer without being present in the document or question. ' |
|
'#7, the answer makes an unjustified assumption by stating both magazines are ' |
|
"British despite the document clearly stating The Woman's Viewpoint was " |
|
'founded in Texas, implying it is not British. Therefore, the answer fails to ' |
|
'accurately reflect the information provided in the document and makes ' |
|
'unjustified assumptions based on the information given in the question and ' |
|
"document.', ") |
|
'FAIL' |
|
``` |
|
|
|
# 4. Model Details |
|
|
|
## 4.1 Overview |
|
|
|
- **Developed by:** [Root Signals Inc](https://www.rootsignals.ai) |
|
- **Model type:** Text-Only Decoder Transformer |
|
- **Language(s) (NLP):** Primarily English |
|
- **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct |
|
|
|
## 4.2 Training Details |
|
|
|
- **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs |
|
- **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X |
|
- **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu) |
|
- **Compute Region:** Finland |
|
|
|
|
|
# 5. Contact |
|
|
|
**Links** |
|
- [Root Signals Homepage](https://www.rootsignals.ai/) |
|
- [Root Signals Platform](https://app.rootsignals.ai/?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals) |
|
- [Python SDK Docs](https://sdk.rootsignals.ai/en/latest/quickstart.html) |
|
- [Root Signals GitHub](https://github.com/root-signals/rs-python-sdk) |
|
- [Discord](https://discord.gg/EhazTQsFnj) |
|
|
|
**Email** |
|
- [email protected] |