PracticeLLM
/

Twice-KoSOLAR-16.1B-test

@@ -20,7 +20,7 @@ license: cc-by-nc-sa-4.0
 여기서 단순한 호기심이 들었다. **Upstage에서 발표한 Depth-Up-Scaling(DUS) 방법론은 mistral-7B 모델 2개를 merge(passthrough)한 방법**이다.
 이때 놀랍게도, DUS 방법론을 적용한 `upstage/SOLAR-10.7B-v1.0`모델은 기존의 mistral-7B 모델보다 리더보드에서 높은 성능을 기록했다. (아래의 테이블 참고)
 그렇다면, DUS 방법론을 제한없이, 다른 모델에 적용하면 똑같은 결과가 발생할지 너무나 궁금했다. 🙃
-일단, 가설은 성능이 비슷하거나 좋아질 것으로 예상된다. 실험을 통해서 나의 호기심에 대한 결론을 내려보고자 한다. 😋😋
 | Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
 | --- | --- | --- | --- | --- | --- | --- | --- |
@@ -74,7 +74,31 @@ dtype: float16
 ## lm-evaluation-harness(zero-shot)
 - Follow up as [beomi/LM-Harness](https://github.com/Beomi/ko-lm-evaluation-harness)
 ```
-(will update)
 ```
 - Follow up as [Eleuther/LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness)

 여기서 단순한 호기심이 들었다. **Upstage에서 발표한 Depth-Up-Scaling(DUS) 방법론은 mistral-7B 모델 2개를 merge(passthrough)한 방법**이다.
 이때 놀랍게도, DUS 방법론을 적용한 `upstage/SOLAR-10.7B-v1.0`모델은 기존의 mistral-7B 모델보다 리더보드에서 높은 성능을 기록했다. (아래의 테이블 참고)
 그렇다면, DUS 방법론을 제한없이, 다른 모델에 적용하면 똑같은 결과가 발생할지 너무나 궁금했다. 🙃
+실험을 통해서 나의 호기심에 대한 결론을 내려보고자 한다. 😋😋
 | Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
 | --- | --- | --- | --- | --- | --- | --- | --- |
 ## lm-evaluation-harness(zero-shot)
 - Follow up as [beomi/LM-Harness](https://github.com/Beomi/ko-lm-evaluation-harness)
 ```
+gpt2 (pretrained=PracticeLLM/Twice-KoSOLAR-16.1B-test), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
+|      Task      |Version| Metric |Value |   |Stderr|
+|----------------|------:|--------|-----:|---|-----:|
+|kobest_boolq    |      0|acc     |0.7201|±  |0.0120|
+|                |       |macro_f1|0.7073|±  |0.0124|
+|kobest_copa     |      0|acc     |0.6510|±  |0.0151|
+|                |       |macro_f1|0.6506|±  |0.0151|
+|kobest_hellaswag|      0|acc     |0.4520|±  |0.0223|
+|                |       |acc_norm|0.5820|±  |0.0221|
+|                |       |macro_f1|0.4475|±  |0.0222|
+|kobest_sentineg |      0|acc     |0.7078|±  |0.0229|
+|                |       |macro_f1|0.7071|±  |0.0229|
+gpt2 (pretrained=yanolja/KoSOLAR-10.7B-v0.1), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
+|      Task      |Version| Metric |Value |   |Stderr|
+|----------------|------:|--------|-----:|---|-----:|
+|kobest_boolq    |      0|acc     |0.8725|±  |0.0089|
+|                |       |macro_f1|0.8722|±  |0.0089|
+|kobest_copa     |      0|acc     |0.6850|±  |0.0147|
+|                |       |macro_f1|0.6844|±  |0.0147|
+|kobest_hellaswag|      0|acc     |0.4340|±  |0.0222|
+|                |       |acc_norm|0.5840|±  |0.0221|
+|                |       |macro_f1|0.4296|±  |0.0221|
+|kobest_sentineg |      0|acc     |0.7506|±  |0.0217|
+|                |       |macro_f1|0.7505|±  |0.0217|
 ```
 - Follow up as [Eleuther/LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness)