File size: 4,089 Bytes

5b4daa9
 
 
 
 
 
 
 
 
 
f4e66e9
5b4daa9
 
 
f4e66e9
5b4daa9
 
f4e66e9
 
5b4daa9
 
 
 
f4e66e9
5b4daa9
 
 
 
 
 
 
 
 
 
 
 
f4e66e9
5b4daa9
 
f4e66e9
5b4daa9
 
 
 
 
f4e66e9
5b4daa9
 
f4e66e9
5b4daa9
 
eb0fc69
5b4daa9
f4e66e9
5b4daa9
 
 
 
 
f4e66e9
5b4daa9
 
f4e66e9
 
5b4daa9
 
 
f4e66e9
5b4daa9
 
 
 
 
 
f4e66e9
5b4daa9
f4e66e9
5b4daa9
f4e66e9
5b4daa9
f4e66e9
5b4daa9
f4e66e9
 
5b4daa9
f4e66e9
5b4daa9
f4e66e9
5b4daa9
f4e66e9
5b4daa9
f4e66e9
5b4daa9
 
 
f4e66e9
 
 
5b4daa9
f4e66e9
5b4daa9
f4e66e9
5b4daa9
f4e66e9

---
library_name: transformers
pipeline_tag: text-generation
tags:
- glm4_moe
- GPTQ
- Int4-Int8Mix
- 量化修复
- vLLM
base_model:
  - zai-org/GLM-4.5
base_model_relation: quantized
---
# GLM-4.5-GPTQ-Int4-Int8Mix
Base model [zai-org/GLM-4.5](https://huggingface.co/zai-org/GLM-4.5)


### 【VLLM Launch Command for 8-GPU Single Node】
<i>Note: When launching this model on 8 GPUs, you must include --enable-expert-parallel, otherwise expert tensor partitioning will fail due to mismatch. This flag is not required for 4-GPU setups.</i>
```
CONTEXT_LENGTH=32768

vllm serve \
    QuantTrio/GLM-4.5-GPTQ-Int4-Int8Mix \
    --served-model-name GLM-4.5-GPTQ-Int4-Int8Mix \
    --enable-expert-parallel \
    --swap-space 16 \
    --max-num-seqs 512 \
    --max-model-len $CONTEXT_LENGTH \
    --max-seq-len-to-capture $CONTEXT_LENGTH \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --disable-log-requests \
    --host 0.0.0.0 \
    --port 8000

```

### 【Dependencies】

```
vllm==0.10.0
```

### 【Model Update】
``` 
2025-07-30
1. fast commit
```

### 【Model Files】

| File Size	    | Last Updated       |
|---------|--------------|
| `192GB` | `2025-07-30` |



### 【Model Download】

```python
from huggingface_hub import snapshot_download
snapshot_download('QuantTrio/GLM-4.5-GPTQ-Int4-Int8Mix', cache_dir="your_local_path")
```


### 【Overview】
# GLM-4.5

<div align="center">
<img src=https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg width="15%"/>
</div>
<p align="center">
    👋 Join our <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community.
    <br>
    📖 Check out the GLM-4.5 <a href="https://z.ai/blog/glm-4.5" target="_blank">technical blog</a>.
    <br>
    📍 Use GLM-4.5 API services on <a href="https://docs.z.ai/guides/llm/glm-4.5">Z.ai API Platform (Global)</a> or <br> <a href="https://docs.bigmodel.cn/cn/guide/models/text/glm-4.5">Zhipu AI Open Platform (Mainland China)</a>.
    <br>
    👉 One click to <a href="https://chat.z.ai">GLM-4.5</a>.
</p>
  
## Model Introduction

The **GLM-4.5** series models are foundation models designed for intelligent agents. GLM-4.5 has **355** billion total parameters with **32** billion active parameters, while GLM-4.5-Air adopts a more compact design with **106** billion total parameters and **12** billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses.

We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development.

As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of **63.2**, in the **3rd** place among all the proprietary and open-source  models. Notably, GLM-4.5-Air delivers competitive results at **59.8** while maintaining superior efficiency.

![bench](https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/bench.png)

For more eval results, show cases, and technical details, please visit
our [technical blog](https://z.ai/blog/glm-4.5). The technical report will be released soon.


The model code, tool parser and reasoning parser can be found in the implementation of [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4_moe), [vLLM](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/glm4_moe_mtp.py) and [SGLang](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/glm4_moe.py).

## Quick Start

Please refer our [github page](https://github.com/zai-org/GLM-4.5) for more detail.