README.md · brandonbeiler/InternVL3_5-1B-FP8-Dynamic at main

InternVL3_5-1B-FP8-Dynamic / README.md

brandonbeiler

Update README.md

2eb36ac verified 29 days ago

preview code

raw

history blame contribute delete

2.77 kB

	---
	language:
	- en
	- zh
	tags:
	- fp8
	- quantization
	- dynamic
	- vision-language
	- multimodal
	- vllm
	- llm-compressor
	- internvl3.5
	base_model: OpenGVLab/InternVL3_5-1B
	base_model_relation: quantized
	pipeline_tag: image-text-to-text
	inference: false
	license: mit
	---
	# 🔥 InternVL3_5-1B-FP8-Dynamic 🔥
	This is a fp8 dynamic (w8a8) version of [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B), optimized for high-performance inference with vLLM.
	The model utilizes fp8 dynamic (w8a8) for optimal performance and deployment.

	## Just Run It (vLLM serve)

	You can serve the model using vLLM's OpenAI-compatible API server.

	```bash
	vllm serve brandonbeiler/InternVL3_5-1B-FP8-Dynamic \
	--quantization compressed-tensors \
	--served-model-name internvl3_5-1b \
	--reasoning-parser qwen3 \
	--trust-remote-code \
	--max-model-len 32768 \
	--tensor-parallel-size 1 # Adjust based on your GPU setup
	```
	Notes
	- 32k max context length
	- reasoning parser ready to go, requires system prompt to run in thinking mode
	- still investigating tool calling


	## 🚀 Key Features
	- FP8 Dynamic Quantization: No calibration required, ready to use immediately
	- Vision-Language Optimized: Specialized quantization recipe that preserves visual understanding
	- vLLM Ready: Seamless integration with vLLM for production deployment
	- Memory Efficient: ~50% memory reduction compared to FP16 original
	- Performance Boost: Significant faster inference on H100/L40S GPUs
	## 📊 Model Details
	- Original Model: [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B)
	- Source Model: OpenGVLab/InternVL3_5-1B
	- Quantized Model: InternVL3_5-1B-FP8-Dynamic
	- Quantization Method: FP8 Dynamic (W8A8)
	- Quantization Library: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1
	- Quantized by: [brandonbeiler](https://huggingface.co/brandonbeiler)

	## 🏗️ Technical Specifications
	### Hardware Requirements
	- Inference: ? VRAM (+ VRAM for context)
	- Supported GPUs: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism)
	- GPU Architecture: Latest NVIDIA GPUs (Ada Lovelace, Hopper and later) and latest AMD GPUs. Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell)
	### Quantization Details
	- Weights: FP8 E4M3 with dynamic per-tensor scales
	- Activations: FP8 E4M3 with dynamic per-tensor scales
	- Preserved Components: Vision tower, embeddings, mlp1
	## 🔬 Package Versions
	This model was created using:
	```
	llmcompressor==0.7.1
	compressed-tensors==0.10.2
	transformers==4.55.0
	torch==2.7.1
	vllm==0.10.1.1
	```

	Quantized with ❤️ using LLM Compressor for the open-source community