Adding Evaluation Results

c4171ba verified over 1 year ago

5.27 kB

	---
	language:
	- en
	license: apache-2.0
	datasets:
	- HuggingFaceTB/cosmopedia
	- EleutherAI/proof-pile-2
	- bigcode/the-stack-dedup
	- math-ai/AutoMathText
	metrics:
	- accuracy
	- code_eval
	model-index:
	- name: Mistral_Pro_8B_v0.1
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 62.2
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TencentARC/Mistral_Pro_8B_v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 82.13
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TencentARC/Mistral_Pro_8B_v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 61.74
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TencentARC/Mistral_Pro_8B_v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 49.32
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TencentARC/Mistral_Pro_8B_v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 76.8
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TencentARC/Mistral_Pro_8B_v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 34.19
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TencentARC/Mistral_Pro_8B_v0.1
	name: Open LLM Leaderboard
	---


	# Mistral-Pro-8B Model Card

	## Model Description
	Mistral-Pro is a progressive version of the original [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) model, enhanced by the addition of Transformer blocks. It specializes in integrating both general language understanding and domain-specific knowledge, particularly in programming and mathematics.

	## Development and Training
	Developed by Tencent's ARC Lab, Mistral-Pro is an 8 billion parameter model. It's an expansion of Mistral-7B, further trained on code and math corpora.

	## Intended Use
	This model is designed for a wide range of NLP tasks, with a focus on programming, mathematics, and general language tasks. It suits scenarios requiring integration of natural and programming languages.

	## Performance
	Mistral_Pro_8B_v0.1 showcases superior performance on a range of benchmarks. It enhances the code and math performance of Mistral. Furthermore, it matches the performance of the recently dominant model, [Gemma](https://huggingface.co/google/gemma-7b).

	### Overall Performance on Languages, math and code tasks

	\| Model \| ARC \| Hellaswag \| MMLU \| TruthfulQA \| Winogrande \| GSM8K \| HumanEval \|
	\| :-: \| :-: \| :-: \| :-: \| :-: \| :-: \| :-: \| :-: \|
	\| Gemma-7B \| 61.9 \| 82.2 \| 64.6 \| 44.8 \| 79.0 \| 50.9 \| 32.3 \|
	\| Mistral-7B \| 60.8 \| 83.3 \| 62.7 \| 42.6 \| 78.0 \| 39.2 \| 28.7 \|
	\| Mistral_Pro_8B_v0.1 \| 63.2 \| 82.6 \| 60.6 \| 48.3 \| 78.9 \| 50.6 \| 32.9 \|


	## Limitations
	While Mistral-Pro addresses some limitations of previous models in the series, it may still encounter challenges specific to highly specialized domains or tasks.

	## Ethical Considerations
	Users should be aware of potential biases in the model and use it responsibly, considering its impact on various applications.
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_TencentARC__Mistral_Pro_8B_v0.1)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|61.06\|
	\|AI2 Reasoning Challenge (25-Shot)\|62.20\|
	\|HellaSwag (10-Shot) \|82.13\|
	\|MMLU (5-Shot) \|61.74\|
	\|TruthfulQA (0-shot) \|49.32\|
	\|Winogrande (5-shot) \|76.80\|
	\|GSM8k (5-shot) \|34.19\|