|
--- |
|
language: |
|
- en |
|
- zh |
|
tags: |
|
- fp8 |
|
- quantization |
|
- dynamic |
|
- vision-language |
|
- multimodal |
|
- vllm |
|
- llm-compressor |
|
- internvl3.5 |
|
base_model: OpenGVLab/InternVL3_5-1B |
|
base_model_relation: quantized |
|
pipeline_tag: image-text-to-text |
|
inference: false |
|
license: mit |
|
--- |
|
# 🔥 InternVL3_5-1B-FP8-Dynamic 🔥 |
|
This is a **fp8 dynamic (w8a8)** version of [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B), optimized for high-performance inference with vLLM. |
|
The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment. |
|
|
|
## Just Run It (vLLM serve) |
|
|
|
You can serve the model using vLLM's OpenAI-compatible API server. |
|
|
|
```bash |
|
vllm serve brandonbeiler/InternVL3_5-1B-FP8-Dynamic \ |
|
--quantization compressed-tensors \ |
|
--served-model-name internvl3_5-1b \ |
|
--reasoning-parser qwen3 \ |
|
--trust-remote-code \ |
|
--max-model-len 32768 \ |
|
--tensor-parallel-size 1 # Adjust based on your GPU setup |
|
``` |
|
**Notes** |
|
- 32k max context length |
|
- reasoning parser ready to go, requires system prompt to run in thinking mode |
|
- still investigating tool calling |
|
|
|
|
|
## 🚀 Key Features |
|
- **FP8 Dynamic Quantization**: No calibration required, ready to use immediately |
|
- **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding |
|
- **vLLM Ready**: Seamless integration with vLLM for production deployment |
|
- **Memory Efficient**: ~50% memory reduction compared to FP16 original |
|
- **Performance Boost**: Significant faster inference on H100/L40S GPUs |
|
## 📊 Model Details |
|
- **Original Model**: [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B) |
|
- **Source Model**: OpenGVLab/InternVL3_5-1B |
|
- **Quantized Model**: InternVL3_5-1B-FP8-Dynamic |
|
- **Quantization Method**: FP8 Dynamic (W8A8) |
|
- **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1 |
|
- **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler) |
|
|
|
## 🏗️ Technical Specifications |
|
### Hardware Requirements |
|
- **Inference**: ? VRAM (+ VRAM for context) |
|
- **Supported GPUs**: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism) |
|
- **GPU Architecture**: Latest NVIDIA GPUs (Ada Lovelace, Hopper and later) and latest AMD GPUs. Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell) |
|
### Quantization Details |
|
- **Weights**: FP8 E4M3 with dynamic per-tensor scales |
|
- **Activations**: FP8 E4M3 with dynamic per-tensor scales |
|
- **Preserved Components**: Vision tower, embeddings, mlp1 |
|
## 🔬 Package Versions |
|
This model was created using: |
|
``` |
|
llmcompressor==0.7.1 |
|
compressed-tensors==0.10.2 |
|
transformers==4.55.0 |
|
torch==2.7.1 |
|
vllm==0.10.1.1 |
|
``` |
|
|
|
*Quantized with ❤️ using LLM Compressor for the open-source community* |
|
|