Update README.md

Browse files

Files changed (1) hide show

README.md +21 -18

README.md CHANGED Viewed

@@ -19,6 +19,26 @@ license: mit
 # 🔥 InternVL3_5-1B-FP8-Dynamic 🔥
 This is a **fp8 dynamic (w8a8)** version of [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B), optimized for high-performance inference with vLLM.
 The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment.
 ## 🚀 Key Features
 - **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
 - **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
@@ -32,23 +52,6 @@ The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment
 - **Quantization Method**: FP8 Dynamic (W8A8)
 - **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1
 - **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
-## 🔧 Usage
-### With vLLM (Recommended)
-```python
-from vllm import LLM, SamplingParams
-# Load the quantized model
-model = LLM(
-    model="brandonbeiler/InternVL3_5-1B-FP8-Dynamic",
-    trust_remote_code=True,
-    max_model_len=32768, # internvl 3.5 is 32k max context
-    tensor_parallel_size=1,  # Adjust based on your GPU setup
-)
-# Generate response
-sampling_params = SamplingParams(temperature=0.6, max_tokens=512) # internvl 3.5 recommends temp 0.6, especially for thinking mode
-response = model.generate("Describe this image: <image>", sampling_params)
-print(response[0].outputs[0].text)
-```
 ## 🏗️ Technical Specifications
 ### Hardware Requirements
@@ -63,7 +66,7 @@ print(response[0].outputs[0].text)
 This model was created using:
 ```
 llmcompressor==0.7.1
-compressed-tensors==latest
 transformers==4.55.0
 torch==2.7.1
 vllm==0.10.1.1

 # 🔥 InternVL3_5-1B-FP8-Dynamic 🔥
 This is a **fp8 dynamic (w8a8)** version of [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B), optimized for high-performance inference with vLLM.
 The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment.
+## Just Run It (vLLM serve)
+You can serve the model using vLLM's OpenAI-compatible API server.
+```bash
+vllm serve brandonbeiler/InternVL3_5-1B-FP8-Dynamic \
+    --quantization compressed-tensors \
+    --served-model-name internvl3_5-1b \
+    --reasoning-parser qwen3 \
+    --trust-remote-code \
+    --max-model-len 32768 \
+    --tensor-parallel-size 1 # Adjust based on your GPU setup
+```
+**Notes**
+- 32k max context length
+- reasoning parser ready to go, requires system prompt to run in thinking mode
+- still investigating tool calling
 ## 🚀 Key Features
 - **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
 - **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
 - **Quantization Method**: FP8 Dynamic (W8A8)
 - **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1
 - **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
 ## 🏗️ Technical Specifications
 ### Hardware Requirements
 This model was created using:
 ```
 llmcompressor==0.7.1
+compressed-tensors==0.10.2
 transformers==4.55.0
 torch==2.7.1
 vllm==0.10.1.1