juletxara commited on
Commit
76d82b7
·
verified ·
1 Parent(s): 8c98fac

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +193 -0
README.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ language:
4
+ - en
5
+ tags:
6
+ - truthfulqa
7
+ - llm-judge
8
+ - hitz
9
+ - gemma
10
+ - en
11
+ - truth-judge
12
+ datasets:
13
+ - HiTZ/truthful_judge
14
+ base_model: google/gemma-2-9b-it
15
+ ---
16
+
17
+ # Model Card for HiTZ/gemma-2-9b-it-en-truth-judge
18
+
19
+ This model card is for a judge model fine-tuned to evaluate truthfulness, based on the work "Truth Knows No Language: Evaluating Truthfulness Beyond English".
20
+
21
+ ## Model Details
22
+
23
+ ### Model Description
24
+
25
+ This model is an LLM-as-a-Judge, fine-tuned from `google/gemma-2-9b-it` to assess the truthfulness of text generated by other language models. The evaluation framework and findings are detailed in the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English." The primary goal of this work is to extend truthfulness evaluations beyond English, covering Basque, Catalan, Galician, and Spanish.
26
+
27
+ - **Developed by:** Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri.
28
+ - **Affiliations:** HiTZ Center - Ixa, University of the Basque Country, UPV/EHU; Elhuyar; Centro de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela; Departament de Traducció i Ciències del Llenguatge, Universitat Pompeu Fabra.
29
+ - **Funded by:** MCIN/AEI/10.13039/501100011033 projects: DeepKnowledge (PID2021-127777OB-C21) and by FEDER, EU; Disargue (TED2021-130810B-C21) and European Union NextGenerationEU/PRTR; DeepMinor (CNS2023-144375) and European Union NextGenerationEU/PRTR; NÓS-ILENIA (2022/TL22/0021533). Xunta de Galicia: Centro de investigación de Galicia accreditation 2024-2027 ED431G-2023/04. UPV/EHU PIF22/84 predoc grant (Blanca Calvo Figueras). Basque Government PhD grant PRE_2024_2_0028 (Julen Etxaniz). Juan de la Cierva contract and project JDC2022-049433-I (Iria de Dios Flores), financed by the MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR.
30
+ - **Shared by:** HiTZ Center
31
+ - **Model type:** LLM-as-a-Judge, based on `Gemma2`
32
+ - **Language(s) (NLP):** Fine-tuned to judge outputs in `English`. The underlying TruthfulQA-Multi benchmark, used for context, covers English, Basque, Catalan, Galician, and Spanish.
33
+ - **License:** The base model `google/gemma-2-9b-it` is governed by the Gemma license. The fine-tuning code, this model's weights, and the TruthfulQA-Multi dataset are publicly available under Apache 2.0.
34
+ - **Finetuned from model:** `google/gemma-2-9b-it`
35
+
36
+ ### Model Sources
37
+
38
+ - **Repository (for the project and fine-tuning code):** `https://github.com/hitz-zentroa/truthfulqa-multi`
39
+ - **Paper:** "Truth Knows No Language: Evaluating Truthfulness Beyond English" (`https://arxiv.org/abs/2502.09387`)
40
+ - **Dataset (TruthfulQA-Multi):** `https://huggingface.co/datasets/HiTZ/truthful_judge`
41
+
42
+ ## Uses
43
+
44
+ ### Direct Use
45
+
46
+ This model is intended for direct use as an LLM-as-a-Judge. It takes a question, a reference answer, and a model-generated answer as input, and outputs a judgment on the truthfulness of the model-generated answer. This is particularly relevant for evaluating models on the TruthfulQA benchmark, specifically for English.
47
+
48
+ ### Downstream Use
49
+
50
+ This judge model could potentially be used as a component in larger systems for content moderation, automated fact-checking research, or as a basis for further fine-tuning on more specific truthfulness-related tasks or domains.
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ This model is not designed for:
55
+ - Generating general-purpose creative text or dialogue.
56
+ - Providing factual information directly (it judges, it doesn't assert).
57
+ - Use in safety-critical applications without thorough validation.
58
+ - Any application intended to deceive or spread misinformation.
59
+ The model's judgments are based on its training and may not be infallible.
60
+
61
+ ## Bias, Risks, and Limitations
62
+
63
+ The model's performance and biases are influenced by its base model (`google/gemma-2-9b-it`) and the TruthfulQA dataset. Key points from "Truth Knows No Language: Evaluating Truthfulness Beyond English":
64
+ - **Language Discrepancies:** LLMs (and thus judges based on them) may perform differently across languages. The paper found English performance to be generally highest.
65
+ - **Anglocentricity:** The original TruthfulQA benchmark has English-centric knowledge and cultural contexts. This model is specifically for English.
66
+ - **Context and Time Dependency:** Universal knowledge questions are handled better across languages than context- and time-dependent ones.
67
+ - **Informativeness:** While this is a "truth-judge", informativeness is a critical component of truthfulness assessment.
68
+ - **Imitative Falsehoods:** TruthfulQA focuses on "imitative falsehoods" – statements that are false but sound plausible or mimic common misconceptions.
69
+
70
+ ### Recommendations
71
+
72
+ - Users should be aware of the potential for varying performance across different types of questions (e.g., universal vs. culturally specific).
73
+ - For critical applications, judgments from this model should be cross-verified, ideally with human oversight.
74
+ - Consider the limitations regarding cultural and temporal variability when interpreting results.
75
+
76
+ ## How to Get Started with the Model
77
+
78
+ This model can be loaded using the Hugging Face `transformers` library.
79
+ '''python
80
+ # Example (conceptual, actual usage depends on task setup)
81
+ from transformers import AutoModelForCausalLM, AutoTokenizer
82
+
83
+ model_name = "HiTZ/gemma-2-9b-it-en-truth-judge"
84
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
85
+ model = AutoModelForCausalLM.from_pretrained(model_name)
86
+
87
+ # For LLM-as-a-Judge, you would typically format input like:
88
+ # prompt = f"Question: {question}\\nReference Answer: {ref_answer}\\nModel Answer: {model_answer}\\nIs the model answer truthful? Judge:"
89
+ # inputs = tokenizer(prompt, return_tensors="pt")
90
+ # outputs = model.generate(**inputs) # Adjust generation parameters as needed
91
+ # judgment = tokenizer.decode(outputs[0], skip_special_tokens=True)
92
+ '''
93
+ Refer to the project repository (`https://github.com/hitz-zentroa/truthfulqa-multi`) for specific examples of how judge models were used in the evaluation.
94
+
95
+ ## Training Details
96
+
97
+ ### Training Data
98
+
99
+ The model was fine-tuned on a dataset derived from the original English TruthfulQA benchmark \cite{lin-etal-2022-truthfulqa}.
100
+ - **Dataset Link:** `https://huggingface.co/datasets/HiTZ/truthful_judge` (original English portion)
101
+ - **Training Data Specifics:** Trained on English data for truth judging.
102
+
103
+ ### Training Procedure
104
+
105
+ The model was fine-tuned as an LLM-as-a-Judge. The methodology was adapted from the original TruthfulQA paper \cite{lin-etal-2022-truthfulqa}, where the model learns to predict whether an answer is truthful given a question and reference answers.
106
+
107
+ #### Preprocessing
108
+
109
+ Inputs were formatted to present the judge model with a question, correct answer(s), and the answer to be judged, prompting it to assess truthfulness.
110
+
111
+ #### Training Hyperparameters
112
+
113
+ - **Training regime:** `bfloat16` mixed precision
114
+ - **Base model:** `google/gemma-2-9b-it`
115
+ - **Epochs:** 5
116
+ - **Learning rate:** 0.01
117
+ - **Batch size:** Refer to project code
118
+ - **Optimizer:** Refer to project code
119
+ - **Transformers Version:** `4.44.2`
120
+
121
+ ## Evaluation
122
+
123
+ ### Testing Data, Factors & Metrics
124
+
125
+ #### Testing Data
126
+
127
+ The model's evaluation methodology is described in "Truth Knows No Language: Evaluating Truthfulness Beyond English," using questions from the TruthfulQA-Multi dataset (English portion).
128
+
129
+ #### Factors
130
+
131
+ - **Language:** English.
132
+ - **Model Type (of models being judged):** Base and instruction-tuned LLMs.
133
+ - **Evaluation Metric:** Correlation of LLM-as-a-Judge scores with human judgments on truthfulness; comparison with multiple-choice metrics (MC2).
134
+
135
+ #### Metrics
136
+
137
+ - **Primary Metric:** Spearman correlation between the judge model's scores and human-annotated scores for truthfulness.
138
+ - The paper found that LLM-as-a-Judge (like this model) correlates more closely with human judgments than multiple-choice metrics. For the general Gemma-2-9b-it judge trained on all languages (MT data), Kappa was 0.74 for English (Table 3 in paper).
139
+
140
+ ### Results
141
+
142
+ #### Summary
143
+
144
+ As reported in "Truth Knows No Language: Evaluating Truthfulness Beyond English":
145
+ - LLMs generally perform best in English.
146
+ - LLM-as-a-Judge models demonstrated a stronger correlation with human judgments compared to MC2 metrics.
147
+ - This specific model (`gemma9b_instruct_truth_judge`) is one of the judge models fine-tuned for the experiments. Refer to Table 3 in the paper for Judge-LLM performance (Gemma 2 9B IT was the base for the best Judge-LLM).
148
+
149
+ ## Technical Specifications
150
+
151
+ ### Model Architecture and Objective
152
+
153
+ The model is based on the `Gemma2` architecture (`Gemma2ForCausalLM`). It is a Causal Language Model fine-tuned with the objective of acting as a "judge" to predict the truthfulness of answers to questions, particularly those designed to elicit imitative falsehoods.
154
+ - **Hidden Size:** 3584
155
+ - **Intermediate Size:** 14336
156
+ - **Num Attention Heads:** 16
157
+ - **Num Hidden Layers:** 42
158
+ - **Num Key Value Heads:** 8
159
+ - **Vocab Size:** 256000
160
+
161
+ ### Compute Infrastructure
162
+
163
+ - **Hardware:** Refer to project for details.
164
+ - **Software:** PyTorch, Transformers `4.44.2`
165
+
166
+ ## Citation
167
+
168
+ **Paper:**
169
+ '''bibtex
170
+ @inproceedings{calvo-etal-2025-truthknowsnolanguage,
171
+ title = "Truth Knows No Language: Evaluating Truthfulness Beyond English",
172
+ author = "Calvo Figueras, Blanca and Sagarzazu, Eneko and Etxaniz, Julen and Barnes, Jeremy and Gamallo, Pablo and De Dios Flores, Iria and Agerri, Rodrigo",
173
+ year={2025},
174
+ eprint={2502.09387},
175
+ archivePrefix={arXiv},
176
+ primaryClass={cs.CL},
177
+ url={https://arxiv.org/abs/2502.09387}
178
+ }
179
+ '''
180
+
181
+ ## More Information
182
+
183
+ For more details on the methodology, dataset, and findings, please refer to the full paper "Truth Knows No Language: Evaluating Truthfulness Beyond English" and the project repository: `https://github.com/hitz-zentroa/truthfulqa-multi`.
184
+
185
+ ## Model Card Authors
186
+
187
+ This model card was generated based on information from the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English" by Blanca Calvo Figueras et al., and adapted from the Hugging Face model card template. Content populated by GitHub Copilot.
188
+
189
+ ## Model Card Contact
190
+
191
+ For questions about the model or the research, please contact:
192
+ - Blanca Calvo Figueras: `[email protected]`
193
+ - Rodrigo Agerri: `[email protected]`