File size: 6,340 Bytes
e53c18c cc25c4e e53c18c 3ad63d6 e53c18c 6cd6ffb e53c18c e4341f1 e53c18c 865c6c9 e53c18c 865c6c9 d60ccc4 e53c18c a6cd84f 9c2f6c6 e53c18c 9c2f6c6 865c6c9 e53c18c 6cd6ffb 9c2f6c6 e549bb0 9c2f6c6 fd98c30 e53c18c 57f9a75 e53c18c 7187211 6cd6ffb 7187211 57f9a75 7187211 57f9a75 6cd6ffb 7187211 ef3cee9 e53c18c e4341f1 e53c18c e4341f1 e53c18c e4341f1 e53c18c ef3cee9 e53c18c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
license: apache-2.0
datasets:
- stanfordnlp/SHP
language:
- en
metrics:
- accuracy
tags:
- human feedback
- rlhf
- preferences
- reddit
- preference model
- RL
- NLG
- evaluation
---
# 💨🚢 SteamSHP-XL
<!-- Provide a quick summary of what the model is/does. -->
SteamSHP-XL is a preference model trained to predict -- given some context and two possible responses -- which response humans will find more helpful.
It can be used for NLG evaluation, question-answering evalation, or to train a smaller reward model for RLHF.
It is a FLAN-T5-xl model (3B parameters) finetuned on:
1. The [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP), which contains collective human preferences sourced from 18 different communities on Reddit (e.g., `askculinary`, `legaladvice`, etc.).
2. The helpfulness data in [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset.
There is a smaller variant called [SteamSHP-Large](https://huggingface.co/kawine/SteamSHP-flan-t5-large) that was made by finetuning FLAN-T5-large (780M parameters).
Despite being 1/4 of the size, it is on average only 0.75 points less accurate on the SHP + Anthropic test data (across all domains).
## Usage
The input text should be of the format:
```
POST: { the context, such as the 'history' column in SHP }
RESPONSE A: { first possible continuation }
RESPONSE B: { second possible continuation }
Which response is better? RESPONSE
```
The output generated by SteamSHP-XL will either be `A` or `B`.
Here's how to use the model:
```python
>> from transformers import T5ForConditionalGeneration, T5Tokenizer
>> device = 'cuda' # if you have a GPU
>> tokenizer = T5Tokenizer.from_pretrained('stanfordnlp/SteamSHP-flan-t5-xl')
>> model = T5ForConditionalGeneration.from_pretrained('stanfordnlp/SteamSHP-flan-t5-xl').to(device)
>> input_text = "POST: Instacart gave me 50 pounds of limes instead of 5 pounds... what the hell do I do with 50 pounds of limes? I've already donated a bunch and gave a bunch away. I'm planning on making a bunch of lime-themed cocktails, but... jeez. Ceviche? \n\n RESPONSE A: Lime juice, and zest, then freeze in small quantities.\n\n RESPONSE B: Lime marmalade lol\n\n Which response is better? RESPONSE"
>> x = tokenizer([input_text], return_tensors='pt').input_ids.to(device)
>> y = model.generate(x, max_new_tokens=1)
>> tokenizer.batch_decode(y, skip_special_tokens=True)
['A']
```
If the input exceeds the 512 token limit, you can use [pybsd](https://github.com/nipunsadvilkar/pySBD) to break the input up into sentences and only include what fits into 512 tokens.
When trying to cram an example into 512 tokens, we recommend truncating the context as much as possible and leaving the responses as untouched as possible.
## Training and Evaluation
SteamSHP-XL was only finetuned on 125K of the 392K training examples that were available, since we found that:
1. When the total input length exceeded the limit (512 tokens), the loss would not converge.
When possible, we crammed an example to fit under 500 tokens by truncating the context as much as possible, though some examples would still not fit despite this.
We used 500 as the limit instead of 512 to allow for slight modifications to the structure of the input without any examples exceeding the actual 512 limit.
3. Training on fewer preferences with a stronger signal led to better performance than training on all the preferences.
From the SHP dataset, we only used preferences where the more preferred comment was twice as preferred as the other (i.e., `score_ratio` >= 2) and used no more than 5 preferences from each context (i.e., 5 examples per unique `post_id`) to prevent ovefitting.
We did no such subsampling for the HH-RLHF training data.
We evaluated the model on the SHP and HH-RLHF test data using accuracy, but only on the data that could be truncated to fit within 500 tokens (a total of 18621 out of 20753 available test examples).
SteamSHP-XL gets an average 72.8% accuracy across all domains:
| Domain | Accuracy |
| ------ | -------- |
| askculinary | 0.7199 |
| askhr | 0.7743 |
| askdocs | 0.7210 |
| askanthropology | 0.7594 |
| asksciencefiction | 0.7283 |
| askacademia | 0.7442 |
| askengineers | 0.7183 |
| legaladvice | 0.8068 |
| explainlikeimfive | 0.7392 |
| askbaking | 0.6741 |
| askphysics | 0.8000 |
| askscience | 0.7114 |
| askphilosophy | 0.6907 |
| askvet | 0.7742 |
| changemyview | 0.7043 |
| askcarguys | 0.7568 |
| askhistorians | 0.7476 |
| asksocialscience | 0.7308 |
| anthropic (helpfulness) | 0.7310 |
| ALL | 0.7278 |
## Biases and Limitations
SteamSHP is trained to predict which of two responses humans will find *more helpful*, not which response is *less harmful*.
It should not be used to detect toxicity, make ethical judgments, or for a similar purpose.
Biases and misinformation in the datasets used to train SteamSHP may also be propagated downstream to the model predictions.
Although SHP filtered out posts with NSFW (over 18) content, chose subreddits that were well-moderated and had policies against harassment and bigotry, some of the data may contain discriminatory or harmful language.
The responses that humans collectively found more helpful are also not guaranteed to be more factual.
The people whose preferences are captured in SHP and HH-RLHF are not representative of the broader population.
Although specific demographic information is not available, overall, the Reddit users whose preferences are captured in SHP are disproportionately male and from developed, Western, and English-speaking countries (Pew Research).
[Past work](https://www.anthropic.com/model-written-evals.pdf) by Anthropic has found that models optimized for human preference can be obsequious, at the expense of the truth.
## Contact
Please contact [email protected] if you have any questions about the model.
This dataset was created by Kawin Ethayarajh, Heidi (Chenyu) Zhang, Yizhong Wang, and Dan Jurafsky.
## Citation
We will have a paper out soon, but until then, please cite:
```
@online{SHP,
author = {Ethayarajh, Kawin and Zhang, Heidi and Wang, Yizhong and Jurafsky, Dan},
title = {Stanford Human Preferences Dataset},
year = 2023,
url = {https://huggingface.co/datasets/stanfordnlp/SHP},
}
``` |