shenzhi-wang commited on
Commit
fafa8b7
·
verified ·
1 Parent(s): 51f420f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -33
README.md CHANGED
@@ -89,45 +89,29 @@ print(response)
89
 
90
  ### 3.1 Arena-Hard-Auto
91
 
92
- All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
93
 
94
  #### 3.1.1 No Style Control
95
 
96
- | | Score | 95% CIs |
97
- | --------------------------------- | ------------------------ | ----------- |
98
- | **Xwen-72B-Chat** 🔑 | **86.1** (Top-1 among 🔑) | (-1.5, 1.7) |
99
- | Qwen2.5-72B-Instruct 🔑 | 78.0 | (-1.8, 1.8) |
100
- | Athene-v2-Chat 🔑 | 85.0 | (-1.4, 1.7) |
101
- | Llama-3.1-Nemotron-70B-Instruct 🔑 | 84.9 | (-1.7, 1.8) |
102
- | Llama-3.1-405B-Instruct-FP8 🔑 | 69.3 | (-2.4, 2.2) |
103
- | Claude-3-5-Sonnet-20241022 🔒 | 85.2 | (-1.4, 1.6) |
104
- | O1-Preview-2024-09-12 🔒 | **92.0** (Top-1 among 🔒) | (-1.2, 1.0) |
105
- | O1-Mini-2024-09-12 🔒 | 90.4 | (-1.1, 1.3) |
106
- | GPT-4-Turbo-2024-04-09 🔒 | 82.6 | (-1.8, 1.5) |
107
- | GPT-4-0125-Preview 🔒 | 78.0 | (-2.1, 2.4) |
108
- | GPT-4o-2024-08-06 🔒 | 77.9 | (-2.0, 2.1) |
109
- | Yi-Lightning 🔒 | 81.5 | (-1.6, 1.6) |
110
- | Yi-Large🔒 | 63.7 | (-2.6, 2.4) |
111
- | GLM-4-0520 🔒 | 63.8 | (-2.9, 2.8) |
112
 
113
  #### 3.1.2 Style Control
114
 
115
- | | Score | 95% CIs |
116
- | --------------------------------- | ------------------------ | ----------- |
117
- | **Xwen-72B-Chat** 🔑 | **72.4** (Top-1 Among 🔑) | (-4.3, 4.1) |
118
- | Qwen2.5-72B-Instruct 🔑 | 63.3 | (-2.5, 2.3) |
119
- | Athene-v2-Chat 🔑 | 72.1 | (-2.5, 2.5) |
120
- | Llama-3.1-Nemotron-70B-Instruct 🔑 | 71.0 | (-2.8, 3.1) |
121
- | Llama-3.1-405B-Instruct-FP8 🔑 | 67.1 | (-2.2, 2.8) |
122
- | Claude-3-5-Sonnet-20241022 🔒 | **86.4** (Top-1 Among 🔒) | (-1.3, 1.3) |
123
- | O1-Preview-2024-09-12 🔒 | 81.7 | (-2.2, 2.1) |
124
- | O1-Mini-2024-09-12 🔒 | 79.3 | (-2.8, 2.3) |
125
- | GPT-4-Turbo-2024-04-09 🔒 | 74.3 | (-2.4, 2.4) |
126
- | GPT-4-0125-Preview 🔒 | 73.6 | (-2.0, 2.0) |
127
- | GPT-4o-2024-08-06 🔒 | 71.1 | (-2.5, 2.0) |
128
- | Yi-Lightning 🔒 | 66.9 | (-3.3, 2.7) |
129
- | Yi-Large-Preview 🔒 | 65.1 | (-2.5, 2.5) |
130
- | GLM-4-0520 🔒 | 61.4 | (-2.6, 2.4) |
131
 
132
 
133
 
 
89
 
90
  ### 3.1 Arena-Hard-Auto
91
 
92
+ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
93
 
94
  #### 3.1.1 No Style Control
95
 
96
+ | | Score | 95% CIs |
97
+ | ----------------------- | -------- | ----------- |
98
+ | **Xwen-7B-Chat** 🔑 | **59.4** | (-2.4, 2.1) |
99
+ | Qwen2.5-7B-Instruct 🔑 | 50.4 | (-2.9, 2.5) |
100
+ | Gemma-2-27B-IT 🔑 | 57.5 | (-2.1, 2.4) |
101
+ | Llama-3.1-8B-Instruct 🔑 | 21.3 | (-1.9, 2.2) |
102
+ | Llama-3-8B-Instruct 🔑 | 20.6 | (-2.0, 1.9) |
103
+ | Starling-LM-7B-beta 🔑 | 23.0 | (-1.8, 1.8) |
 
 
 
 
 
 
 
 
104
 
105
  #### 3.1.2 Style Control
106
 
107
+ | | Score | 95% CIs |
108
+ | ----------------------- | -------- | ----------- |
109
+ | **Xwen-7B-Chat** 🔑 | **50.3** | (-3.8, 2.8) |
110
+ | Qwen2.5-7B-Instruct 🔑 | 46.9 | (-3.1, 2.7) |
111
+ | Gemma-2-27B-IT 🔑 | 47.5 | (-2.5, 2.7) |
112
+ | Llama-3.1-8B-Instruct 🔑 | 18.3 | (-1.6, 1.6) |
113
+ | Llama-3-8B-Instruct 🔑 | 19.8 | (-1.6, 1.9) |
114
+ | Starling-LM-7B-beta 🔑 | 26.1 | (-2.6, 2.0) |
 
 
 
 
 
 
 
 
115
 
116
 
117