# MiniMax-Text-01
## 1. Introduction
MiniMax-Text-01 is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax-Text-01's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax-Text-01 also demonstrates the performance of a top-tier model.
   
## 2. Model Architecture
The architecture of MiniMax-Text-01 is briefly described as follows:
- Total Parameters: 456B
- Activated Parameters per Token: 45.9B
- Number Layers: 80
- Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
  - Number of attention heads: 64
  - Attention head dimension: 128
- Mixture of Experts:
  - Number of experts: 32
  - Expert hidden dimension: 9216
  - Top-2 routing strategy
- Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
- Hidden Size: 6144
- Vocab Size: 200,064
## 3. Evaluation
### Core Academic Benchmarks
| **Tasks**                     | **GPT-4o (11-20)** | **Claude-3.5-Sonnet (10-22)** | **Gemini-1.5-Pro (002)** | **Gemini-2.0-Flash (exp)** | **Qwen2.5-72B-Inst.** | **DeepSeek-V3** | **Llama-3.1-405B-Inst.** | **MiniMax-Text-01** |
|-------------------------------|--------------------|-------------------------------|--------------------------|----------------------------|-----------------------|-----------------|--------------------------|---------------------|
| **General**                   |                    |                               |                          |                            |                       |                 |                          |                     |
| MMLU*                      | 85.7               | 88.3                          | 86.8                     | 86.5                       | 86.1                  | 88.5        | **88.6**                 | 88.5                |
| MMLU-Pro*                  | 74.4               | **78.0**                      | 75.8                     | 76.4                       | 71.1                  | 75.9            | 73.3                     | 75.7                |
| SimpleQA                      | **39.0**           | 28.1                          | 23.4                     | 26.6                       | 10.3                  | 24.9            | 23.2                     | 23.7                |
| C-SimpleQA                    | 64.6               | 56.8                          | 59.4                     | 63.3                       | 52.2                  | 64.8            | 54.7                     | **67.4**            |
| IFEval _(avg)_                | 84.1               | **90.1**                      | 89.4                     | 88.4                       | 87.2                  | 87.3            | 86.4                     | 89.1                |
| Arena-Hard                    | **92.4**           | 87.6                          | 85.3                     | 72.7                       | 81.2                  | 91.4            | 63.5                     | 89.1                |
| **Reasoning**                 |                    |                               |                          |                            |                       |                 |                          |                     |
| GPQA* _(diamond)_          | 46.0               | **65.0**                      | 59.1                     | 62.1                       | 49.0                  | 59.1            | 50.7                     | 54.4                |
| DROP* _(F1)_               | 89.2               | 88.8                          | 89.2                     | 89.3                       | 85.0                  | 91.0        | **92.5**                 | 87.8                |
| **Mathematics**               |                    |                               |                          |                            |                       |                 |                          |                     |
| GSM8k*                     | 95.6               | **96.9**                      | 95.2                     | 95.4                       | 95.8                  | 96.7            | 96.7                     | 94.8                |
| MATH*                      | 76.6               | 74.1                          | **84.6**                 | 83.9                       | 81.8                  | **84.6**        | 73.8                     | 77.4                |
| **Coding**                    |                    |                               |                          |                            |                       |                 |                          |                     |
| MBPP +                        | 76.2               | 75.1                          | 75.4                     | 75.9                       | 77.0              | **78.8**        | 73.0                     | 71.7                |
| HumanEval                     | 90.2               | **93.7**                      | 86.6                     | 89.6                       | 86.6                  | 92.1            | 89.0                     | 86.9                |
* Evaluated following a _0-shot CoT_ setting.
### Long Benchmarks
#### 4M Needle In A Haystack Test
   
#### Ruler
| Model | 4k | 8k | 16k | 32k | 64k | 128k | 256k | 512k | 1M |
|-------|----|----|-----|-----|-----|------|------|------|----|
| **GPT-4o (11-20)** | **0.970** | 0.921 | 0.890 | 0.888 | 0.884 | - | - | - | - |
| **Claude-3.5-Sonnet (10-22)** | 0.965 | 0.960 | 0.957 | 0.950 | **0.952** | 0.938 | - | - | - |
| **Gemini-1.5-Pro (002)** | 0.962 | 0.960 | **0.960** | **0.958** | 0.938 | 0.917 | 0.916 | 0.861 | 0.850 |
| **Gemini-2.0-Flash (exp)** | 0.960 | 0.960 | 0.951 | 0.957 | 0.937 | 0.860 | 0.797 | 0.709 | - |
| **MiniMax-Text-01** | 0.963 | **0.961** | 0.953 | 0.954 | 0.943 | **0.947** | **0.945** | **0.928** | **0.910** |
#### LongBench v2
| **Model**                  | **overall** | **easy** | **hard** | **short** | **medium** | **long** |
|----------------------------|-------------|----------|----------|------------|------------|----------|
| Human                      | 53.7        | 100.0    | 25.1     | 47.2       | 59.1       | 53.7     |
| **w/ CoT**                 |             |          |          |            |            |          |
| GPT-4o (11-20)             | 51.4        | 54.2     | 49.7     | 59.6       | 48.6       | 43.5     |
| Claude-3.5-Sonnet (10-22)  | 46.7        | 55.2     | 41.5     | 53.9       | 41.9       | 44.4     |
| Deepseek-V3                | -           | -        | -        | -          | -          | -        |
| Qwen2.5-72B-Inst.          | 43.5        | 47.9     | 40.8     | 48.9       | 40.9       | 39.8     |
| **MiniMax-Text-01**        | **56.5**    | **66.1** | **50.5** | **61.7**   | **56.7**   | **47.2** |
| **w/o CoT**                |             |          |          |            |            |          |
| GPT-4o (11-20)             | 50.1        | 57.4     | 45.6     | 53.3       | 52.4       | 40.2     |
| Claude-3.5-Sonnet (10-22)  | 41.0        | 46.9     | 37.3     | 46.1       | 38.6       | 37.0     |
| Deepseek-V3                | 48.7        | -        | -        | -          | -          | -        |
| Qwen2.5-72B-Inst.          | 42.1        | 42.7     | 41.8     | 45.6       | 38.1       | **44.4** |
| **MiniMax-Text-01**        | **52.9**    | **60.9** | **47.9** | **58.9**   | **52.6**   | 43.5     |
#### MTOB
| **Context Type** | **no context** | **half book** | **full book** | **Δ half book** | **Δ full book** |
|------------------|----------------|---------------|---------------|------------------|-----------------|
| **eng → kalam (ChrF)** | | | | | |
| GPT-4o (11-20) | 9.90 | **54.30** | - | 44.40 | - |
| Claude-3.5-Sonnet (10-22) | 20.22 | 53.62 | 55.65 | 33.39 | 35.42 |
| Gemini-1.5-Pro (002) | 16.79 | 53.68 | **57.90** | 36.89 | 41.11 |
| Gemini-2.0-Flash (exp) | 12.20 | 49.50 | 53.30 | 37.30 | 41.10 |
| Qwen-Long | 16.55 | 48.48 | 45.94 | 31.92 | 29.39 |
| **MiniMax-Text-01** | 6.0 | 51.74 | 51.60 | **45.7** | **45.6** |
| **kalam → eng (BLEURT)** | | | | | |
| GPT-4o (11-20) | 33.20 | 58.30 | - | 25.10 | - |
| Claude-3.5-Sonnet (10-22) | 31.42 | 59.70 | 62.30 | 28.28 | 30.88 |
| Gemini-1.5-Pro (002) | 32.02 | **61.52** | **63.09** | **29.50** | **31.07** |
| Gemini-2.0-Flash (exp) | 33.80 | 57.50 | 57.00 | 23.70 | 23.20 |
| Qwen-Long | 30.13 | 53.14 | 32.15 | 23.01 | 2.02 |
| **MiniMax-Text-01** | 33.65 | 57.10 | 58.00 | 23.45 | 24.35 |
## 4. Quickstart
Here we provide a simple example of loading the tokenizer and model to generate content.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, QuantoConfig, GenerationConfig
# load hf config
hf_config = AutoConfig.from_pretrained("MiniMax-Text-01", trust_remote_code=True)
# quantization config, int8 is recommended
quantization_config =  QuantoConfig(
            weights="int8",
            modules_to_not_convert=[
                "lm_head",
                "embed_tokens",
            ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
            + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
        )
# set device map
device_map = {
    'model.embed_tokens': 'cuda:0',
    'model.norm': f'cuda:{world_size - 1}',
    'lm_head': f'cuda:{world_size - 1}'
}
# assume 8 GPUs
world_size = 8
layers_per_device = hf_config.num_hidden_layers // world_size
for i in range(world_size):
    for j in range(layers_per_device):
        device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("MiniMax-Text-01")
prompt = "Hello!"
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]},
    {"role": "user", "content": [{"type": "text", "text": prompt}]},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
# tokenize and move to device
model_inputs = tokenizer(text, return_tensors="pt").to("cuda")
# load bfloat16 model, move to device, and apply quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMax-Text-01",
    torch_dtype="bfloat16",
    device_map=device_map,
    quantization_config=quantization_config,
    trust_remote_code=True,
    offload_buffers=True,
)
# generate response
generation_config = GenerationConfig(
    max_new_tokens=20,
    eos_token_id=200020,
    use_cache=True,
)
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
print(f"generated_ids: {generated_ids}")
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```
## 5. Chatbot & API
For general use and evaluation, we provide a [Chatbot](https://www.hailuo.ai/) with online search capabilities and the [online API](https://intl.minimaxi.com) for developers.
Contact us at [model@minimaxi.com](mailto:model@minimaxi.com).