HiTZ
/

gemma-2-9b-multi-truth-judge

@@ -27,14 +27,14 @@ This model card is for a judge model fine-tuned to evaluate truthfulness, based
 ### Model Description
-This model is an LLM-as-a-Judge, fine-tuned from `google/gemma-2-9b` to assess the truthfulness of text generated by other language models. The evaluation framework and findings are detailed in the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English." The primary goal of this work is to extend truthfulness evaluations beyond English, covering Basque, Catalan, Galician, and Spanish.
 - **Developed by:** Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri.
 - **Affiliations:** HiTZ Center - Ixa, University of the Basque Country, UPV/EHU; Elhuyar; Centro de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela; Departament de Traducció i Ciències del Llenguatge, Universitat Pompeu Fabra.
 - **Funded by:** MCIN/AEI/10.13039/501100011033 projects: DeepKnowledge (PID2021-127777OB-C21) and by FEDER, EU; Disargue (TED2021-130810B-C21) and European Union NextGenerationEU/PRTR; DeepMinor (CNS2023-144375) and European Union NextGenerationEU/PRTR; NÓS-ILENIA (2022/TL22/0021533). Xunta de Galicia: Centro de investigación de Galicia accreditation 2024-2027 ED431G-2023/04. UPV/EHU PIF22/84 predoc grant (Blanca Calvo Figueras). Basque Government PhD grant PRE_2024_2_0028 (Julen Etxaniz). Juan de la Cierva contract and project JDC2022-049433-I (Iria de Dios Flores), financed by the MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR.
 - **Shared by:** HiTZ Center
 - **Model type:** LLM-as-a-Judge, based on `Gemma2`
-- **Language(s) (NLP):** Fine-tuned to judge outputs in English, Spanish, Catalan, Galician, and Basque. The underlying TruthfulQA-Multi benchmark, used for context, covers these languages.
 - **License:** The base model `google/gemma-2-9b` is governed by the Gemma license. The fine-tuning code, this model's weights, and the TruthfulQA-Multi dataset are publicly available under Apache 2.0.
 - **Finetuned from model:** `google/gemma-2-9b`
@@ -101,9 +101,9 @@ Refer to the project repository (`https://github.com/hitz-zentroa/truthfulqa-mul
 ### Training Data
-The model was fine-tuned on a dataset derived from the professionally translated multilingual extensions (Basque, Catalan, Galician, Spanish) and the original English TruthfulQA benchmark \cite{lin-etal-2022-truthfulqa} created for the "Truth Knows No Language" project.
 - **Dataset Link:** `https://huggingface.co/datasets/HiTZ/truthful_judge`
-- **Training Data Specifics:** Trained on multilingual data (English, Basque, Catalan, Galician, Spanish) for truth judging. Specifically, it was trained on the Machine Translated (MT) version of the data as per Table 3 in the paper for the `gemma-2-9b` base judge.
 ### Training Procedure
@@ -129,11 +129,11 @@ Inputs were formatted to present the judge model with a question, correct answer
 #### Testing Data
-The model's evaluation methodology is described in "Truth Knows No Language: Evaluating Truthfulness Beyond English," using questions from the TruthfulQA-Multi dataset.
 #### Factors
-- **Language:** English, Basque, Catalan, Galician, Spanish.
 - **Model Type (of models being judged):** Base and instruction-tuned LLMs.
 - **Evaluation Metric:** Correlation of LLM-as-a-Judge scores with human judgments on truthfulness; comparison with multiple-choice metrics (MC2).
@@ -146,10 +146,9 @@ The model's evaluation methodology is described in "Truth Knows No Language: Eva
 #### Summary
-As reported in "Truth Knows No Language: Evaluating Truthfulness Beyond English":
-- LLMs generally perform best in English and worst in Basque.
-- LLM-as-a-Judge models demonstrated a stronger correlation with human judgments compared to MC2 metrics.
-- This specific model (`gemma9b_multi_truth_judge`) is one of the judge models fine-tuned for the experiments. Refer to Table 3 in the paper for its correlation scores.
 ## Technical Specifications

 ### Model Description
+This model is an LLM-as-a-Judge, fine-tuned from `google/gemma-2-9b` to assess the truthfulness of text generated by other language models. The evaluation framework and findings are detailed in the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English." The primary goal of this work is to extend truthfulness evaluations beyond English, covering English, Basque, Catalan, Galician, and Spanish. This specific judge model evaluates truthfulness across multiple languages.
 - **Developed by:** Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri.
 - **Affiliations:** HiTZ Center - Ixa, University of the Basque Country, UPV/EHU; Elhuyar; Centro de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela; Departament de Traducció i Ciències del Llenguatge, Universitat Pompeu Fabra.
 - **Funded by:** MCIN/AEI/10.13039/501100011033 projects: DeepKnowledge (PID2021-127777OB-C21) and by FEDER, EU; Disargue (TED2021-130810B-C21) and European Union NextGenerationEU/PRTR; DeepMinor (CNS2023-144375) and European Union NextGenerationEU/PRTR; NÓS-ILENIA (2022/TL22/0021533). Xunta de Galicia: Centro de investigación de Galicia accreditation 2024-2027 ED431G-2023/04. UPV/EHU PIF22/84 predoc grant (Blanca Calvo Figueras). Basque Government PhD grant PRE_2024_2_0028 (Julen Etxaniz). Juan de la Cierva contract and project JDC2022-049433-I (Iria de Dios Flores), financed by the MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR.
 - **Shared by:** HiTZ Center
 - **Model type:** LLM-as-a-Judge, based on `Gemma2`
+- **Language(s) (NLP):** Fine-tuned to judge outputs in multiple languages (English, Basque, Catalan, Galician, Spanish). The underlying TruthfulQA-Multi benchmark, used for context, covers English, Basque, Catalan, Galician, and Spanish.
 - **License:** The base model `google/gemma-2-9b` is governed by the Gemma license. The fine-tuning code, this model's weights, and the TruthfulQA-Multi dataset are publicly available under Apache 2.0.
 - **Finetuned from model:** `google/gemma-2-9b`
 ### Training Data
+The model was fine-tuned on a dataset derived from the TruthfulQA-Multi benchmark \cite{calvo-etal-2025-truthknowsnolanguage}.
 - **Dataset Link:** `https://huggingface.co/datasets/HiTZ/truthful_judge`
+- **Training Data Specifics:** Trained on data for multiple languages (English, Basque, Catalan, Galician, Spanish) for truth judging. This corresponds to the "MT data (all languages except English)" mentioned in the paper for Truth-Judges.
 ### Training Procedure
 #### Testing Data
+The model's evaluation methodology is described in "Truth Knows No Language: Evaluating Truthfulness Beyond English," using questions from the TruthfulQA-Multi dataset (English, Basque, Catalan, Galician, Spanish portions).
 #### Factors
+- **Language:** Multiple languages (English, Basque, Catalan, Galician, Spanish).
 - **Model Type (of models being judged):** Base and instruction-tuned LLMs.
 - **Evaluation Metric:** Correlation of LLM-as-a-Judge scores with human judgments on truthfulness; comparison with multiple-choice metrics (MC2).
 #### Summary
+As reported in "Truth Knows No Language: Evaluating Truthfulness Beyond English" (specifically Table 4 for Truth-Judges):
+- This specific model (`gemma9b_multi_truth_judge`) is the Truth-Judge fine-tuned on `google/gemma-2-9b` using combined multilingual data (English, Basque, Catalan, Galician, Spanish).
+- Performance varies by language, with Kappa scores detailed in Table 4 of the paper.
 ## Technical Specifications