README.md · nikodoz/qwen2.5-7b-instruct-int4 at main

qwen2.5-7b-instruct-int4 / README.md

nikodoz

Add detailed model card

7070815 verified 4 months ago

preview code

raw

history blame contribute delete

3.37 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-7B-Instruct
	tags:
	- quantized
	- int4
	- bitsandbytes
	- qwen2.5
	- chinese
	- conversational
	- instruction-following
	language:
	- zh
	- en
	library_name: transformers
	pipeline_tag: text-generation
	---

	# 🚀 Qwen2.5-7B-Instruct INT4 量化模型

	这是基于 [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) 的 INT4 量化版本，使用 `bitsandbytes` 库进行量化。

	## 📊 模型信息

	- 基础模型: Qwen/Qwen2.5-7B-Instruct
	- 量化类型: INT4 (4-bit)
	- 量化方法: BitsAndBytesConfig with NF4
	- 压缩比率: ~3.5x
	- 显存节省: ~75%

	## ⚙️ 量化配置

	```python
	from transformers import BitsAndBytesConfig
	import torch

	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_quant_storage=torch.uint8,
	)
	```

	## 🚀 使用方法

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

	# 量化配置
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_quant_storage=torch.uint8,
	)

	# 加载模型和分词器
	model_name = "nikodoz/qwen2.5-7b-instruct-int4"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	quantization_config=bnb_config,
	device_map="auto",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True
	)

	# 推理示例
	messages = [
	{"role": "system", "content": "你是一个有用的AI助手。"},
	{"role": "user", "content": "请介绍一下机器学习。"}
	]

	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	top_p=0.8,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id,
	)

	response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
	print(response)
	```

	## 📈 性能对比

	\| 指标 \| 原始模型 (FP16) \| 量化模型 (INT4) \| 提升 \|
	\|------\|----------------\|----------------\|------\|
	\| 模型大小 \| ~14GB \| ~4GB \| 3.5x 压缩 \|
	\| 显存使用 \| ~14GB \| ~4GB \| 75% 减少 \|
	\| 推理速度 \| 基准 \| 略快 \| ~10% \|
	\| 生成质量 \| 100% \| ~95% \| 轻微损失 \|

	## 🔧 环境要求

	- Python >= 3.8
	- PyTorch >= 2.0.0
	- transformers >= 4.40.0
	- bitsandbytes >= 0.43.0
	- CUDA >= 11.0

	## 💡 注意事项

	1. 首次加载时需要进行量化，可能需要几分钟时间
	2. 需要支持 bitsandbytes 的 CUDA 环境
	3. 量化会带来轻微的精度损失，但显存使用显著减少
	4. 适合在资源受限的环境中部署大型语言模型

	## 📄 许可证

	本模型基于原始 Qwen2.5 模型，遵循 Apache-2.0 许可证。

	## 🙏 致谢

	- [Qwen团队](https://github.com/QwenLM/Qwen2.5) 提供的优秀基础模型
	- [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) 提供的量化技术
	- [Hugging Face](https://huggingface.co) 提供的模型托管平台