brandonbeiler commited on
Commit
cc18790
·
verified ·
1 Parent(s): 3f30ec0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -18
README.md CHANGED
@@ -19,6 +19,26 @@ license: mit
19
  # 🔥 InternVL3_5-1B-FP8-Dynamic 🔥
20
  This is a **fp8 dynamic (w8a8)** version of [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B), optimized for high-performance inference with vLLM.
21
  The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ## 🚀 Key Features
23
  - **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
24
  - **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
@@ -32,23 +52,6 @@ The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment
32
  - **Quantization Method**: FP8 Dynamic (W8A8)
33
  - **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1
34
  - **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
35
- ## 🔧 Usage
36
- ### With vLLM (Recommended)
37
- ```python
38
- from vllm import LLM, SamplingParams
39
-
40
- # Load the quantized model
41
- model = LLM(
42
- model="brandonbeiler/InternVL3_5-1B-FP8-Dynamic",
43
- trust_remote_code=True,
44
- max_model_len=32768, # internvl 3.5 is 32k max context
45
- tensor_parallel_size=1, # Adjust based on your GPU setup
46
- )
47
- # Generate response
48
- sampling_params = SamplingParams(temperature=0.6, max_tokens=512) # internvl 3.5 recommends temp 0.6, especially for thinking mode
49
- response = model.generate("Describe this image: <image>", sampling_params)
50
- print(response[0].outputs[0].text)
51
- ```
52
 
53
  ## 🏗️ Technical Specifications
54
  ### Hardware Requirements
@@ -63,7 +66,7 @@ print(response[0].outputs[0].text)
63
  This model was created using:
64
  ```
65
  llmcompressor==0.7.1
66
- compressed-tensors==latest
67
  transformers==4.55.0
68
  torch==2.7.1
69
  vllm==0.10.1.1
 
19
  # 🔥 InternVL3_5-1B-FP8-Dynamic 🔥
20
  This is a **fp8 dynamic (w8a8)** version of [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B), optimized for high-performance inference with vLLM.
21
  The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment.
22
+
23
+ ## Just Run It (vLLM serve)
24
+
25
+ You can serve the model using vLLM's OpenAI-compatible API server.
26
+
27
+ ```bash
28
+ vllm serve brandonbeiler/InternVL3_5-1B-FP8-Dynamic \
29
+ --quantization compressed-tensors \
30
+ --served-model-name internvl3_5-1b \
31
+ --reasoning-parser qwen3 \
32
+ --trust-remote-code \
33
+ --max-model-len 32768 \
34
+ --tensor-parallel-size 1 # Adjust based on your GPU setup
35
+ ```
36
+ **Notes**
37
+ - 32k max context length
38
+ - reasoning parser ready to go, requires system prompt to run in thinking mode
39
+ - still investigating tool calling
40
+
41
+
42
  ## 🚀 Key Features
43
  - **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
44
  - **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
 
52
  - **Quantization Method**: FP8 Dynamic (W8A8)
53
  - **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1
54
  - **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ## 🏗️ Technical Specifications
57
  ### Hardware Requirements
 
66
  This model was created using:
67
  ```
68
  llmcompressor==0.7.1
69
+ compressed-tensors==0.10.2
70
  transformers==4.55.0
71
  torch==2.7.1
72
  vllm==0.10.1.1