File size: 19,155 Bytes
8186365 0682eb0 8186365 b6a3398 a8b23e3 b6a3398 8186365 4ba3236 8186365 6df8ee9 8186365 39ae30a 0d880ef 8186365 82f2e21 8186365 39ae30a 8186365 39ae30a db17193 39ae30a 82f2e21 db17193 82f2e21 db17193 82f2e21 39ae30a 82f2e21 39ae30a 82f2e21 8186365 6df8ee9 85e1c61 99d5983 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 |
---
license: other
license_name: hyperclovax-seed
license_link: LICENSE
library_name: transformers
---

## **Overview**
HyperCLOVAX-SEED-Vision-Instruct-3B is a model developed by NAVER, built upon its proprietary backbone model and fine-tuned through post-training. It is capable of understanding both text and images, as well as generating text.
The model is primarily designed with a focus on lightweight architecture, optimizing computational efficiency. In terms of visual understanding, it can handle visual question answering (VQA), chart and diagram interpretation, and even comprehend content. HyperCLOVAX-SEED-Vision-Instruct-3B aims for a Pareto-optimal balance specifically tuned for the Korean language, and it demonstrates competitive performance using fewer visual tokens compared to other models of similar size in inference scenarios.
Particularly, the model shows relative strengths in handling Korean-language inputs and outperforms similarly sized open-source models in related benchmarks. As the first open-source vision-language model in Korea capable of visual understanding, it is expected to significantly contribute to strengthening Korea's sovereign AI capabilities.
## **Updates**
- **(2025.07.25)**: vLLM engine is available with [our repository](https://github.com/NAVER-Cloud-HyperCLOVA-X/vllm/tree/v0.9.2rc2_hyperclovax_vision_seed)
- **(2025.07.08)**: Major code update for supporting vLLM engine ([link - related_discussion](https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B/discussions/27))
- **(2025.04.22)**: Initial release of the repository.
## **Basic Information**
- **Model Architecture**: LLaVA-based Vision-Language Model
- **LLM Module**: Transformer-based architecture (Dense Model)
- **Vision Encoder** : SigLIP-based architecture with 378x378px input resolution per grid.
- **Vision-Language Connector** : C-Abstractor based architecture with AnyRes mechanism, supporting up to 1.29M total pixels across 9 grids.
- **Parameter Count**: 3.2B (LLM Module) + 0.43B (Vision Module)
- **Input/Output Format**: Text + Image + Video / Text
- **Context Length**: 16k
- **Knowledge Cutoff Date**: The model was trained on data collected before August 2024.
## **Training**
#### **Text**
Securing high-quality data is essential even during post-training, but having humans manually create or revise large-scale datasets posed significant limitations in terms of both cost and resources. Additionally, tasks requiring domain expertise were difficult to handle, and the risk of human error was high. To overcome these challenges, we utilized an automated validation system powered by HyperCLOVA X, which improved data quality and streamlined the training process — ultimately leading to enhanced overall model performance. As a result, the model showed significant improvements in areas with definitive answers, such as mathematics and coding.
While reducing the cost of data collection is important, finding efficient training strategies is equally critical. HyperCLOVAX-SEED-Vision-Instruct-3B was developed starting from the HyperCLOVAX-SEED-Text-Base-3B and applied both Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) based on an online reinforcement algorithm called GRPO.
#### **Vision**
The Vision Understanding feature — where the model receives images and questions as input and generates text-based answers — was not part of the initial design of HyperCLOVA X. Therefore, the model architecture was carefully designed to add capabilities for handling vision-related tasks, such as image-based question answering (VQA) and chart/diagram interpretation, without compromising the existing performance of the HCX LLM. Special attention was given to handling auxiliary information within the input, especially considering the context length.
Although HyperCLOVAX-SEED-Vision-Instruct-3B is a lightweight model, it is capable of performing basic image VQA tasks and even supports OCR-free processing. One of the key focus areas for this 3B model was optimizing the efficiency of video input tokens. Since input token length directly affects computational cost, the number of tokens extracted per frame was carefully adjusted to enable efficient video understanding with as few tokens as possible. Additionally, during the RLHF training phase, vision-specific V-RLHF data was used to enhance the model’s learning, just like in the text domain.
## Benchmark
#### Text
| **Model** | **KMMLU (5-shot, acc)** | **HAE-RAE (5-shot, acc)** | **CLiCK (5-shot, acc)** | **KoBEST (5-shot, acc)** |
|----------------------------|--------|---------|---------|-------|
| HyperCLOVAX-SEED-Text-Base-3B | 0.4847 | 0.7635 | 0.6386 | 0.7792 |
| HyperCLOVAX-SEED-Vision-Instruct-3B| 0.4422 | 0.6499 | 0.5599 | 0.7180 |
| Qwen2.5-3B-instruct | 0.4451 | 0.6031 | 0.5649 | 0.7053 |
| gemma-3-4b-it | 0.3895 | 0.6059 | 0.5303 | 0.7262 |
#### Vision
| Model Name | Max Token Count per Video | VideoMME (Ko) | NAVER-TV-CLIP (Ko) | VideoChatGPT (Ko) | PerceptionTest (En) | ActivityNet-QA (En) | KoNet (Ko) | MMBench-Val (En) | TextVQA-Val (En) | Korean VisIT-Bench (Ko) | Image (4 benchmarks) | Video (5 benchmarks) | All (9 benchmarks) |
|-----------------------------------|--------------------------------|----------------|---------------------|--------------------|-----------------------|----------------------|------------|-------------------|-------------------|--------------------------|------------------------|------------------------|----------------------|
| HyperCLOVAX-SEED-Vision-Instruct-3B | 1856 tokens, 108 frames | 48.2 | 61.0 | 53.6 | 55.2 | 50.6 | 69.2 | 81.8 | 79.2 | 37.0 | 46.68 | 53.70 | 59.54 |
| HyperCLOVAX-SEED-Vision-Instruct-3B (without OCR)| 1856 tokens, 108 frames | 48.2 | 61.0 | 53.6 | 55.2 | 50.6 | 36.6 | 80.7 | 76.0 | 43.5 | 56.74 | 53.70 | 55.05 |
| Qwen-2.5-VL-3B | 24576 tokens, 768 frames | 55.1 | 48.3 | 45.6 | 66.9 | 55.7 | 58.3 | 84.3 | 79.6 | 81.5 | 59.35 | 54.31 | 56.55 |
| Qwen-2.5-VL-3B (w/ 2000 tokens) | 2000 tokens, 128 frames | 50.3 | 43.9 | 44.3 | 58.3 | 54.2 | 58.5 | 84.3 | 79.3 | 15.7 | 59.50 | 50.18 | 54.33 |
| Qwen-2.5-VL-7B | 24576 tokens, 768 frames | 60.6 | 66.7 | 51.8 | 70.5 | 56.6 | 68.4 | 88.3 | 84.9 | 85.6 | 69.34 | 61.23 | 64.84 |
| Gemma-3-4B | 4096 tokens, 16 frames | 45.4 | 36.8 | 57.1 | 50.6 | 46.3 | 25.0 | 79.2 | 58.9 | 32.3 | 48.91 | 47.24 | 47.98 |
| GPT4V (gpt-4-turbo-2024-04-09) | Unknown, Original Image , 8 frames | 49.1 | 75.0 | 55.5 | 57.4 | 45.7 | 38.7 | 84.2 | 60.4 | 52.0 | 58.88 | 51.59 | 54.83 |
| GPT4o (gpt-4o-2024-08-06) | Unknown, 512 resize, 128 frames| 61.6 | 66.6 | 61.8 | 50.2 | 41.7 | 60.6 | 84.2 | 73.2 | 50.5 | 67.15 | 56.42 | 61.19 |
| InternV-2-2B | 4096 tokens, 16 frames | 28.9 | 21.1 | 40.2 | 50.5 | 50.3 | 3.3 | 79.3 | 75.1 | 51.1 | 39.74 | 38.19 | 38.88 |
| InternV-2-4B | 4096 tokens, 16 frames | 33.8 | 36.0 | 22.8 | 54.2 | 52.0 | 22.7 | 83.0 | 76.9 | 51.6 | 46.11 | 39.75 | 42.58 |
| InternV-2-8B | 4096 tokens, 16 frames | 43.7 | 41.2 | 32.4 | 58.5 | 53.2 | 28.5 | 86.6 | 79.0 | 97.0 | 50.32 | 45.79 | 47.81 |
## Dependencies
- [einops](https://einops.rocks/)
- [timm](https://github.com/huggingface/pytorch-image-models)
- [av](https://github.com/PyAV-Org/PyAV)
- [decord](https://github.com/dmlc/decord)
## Example
**(code & benchmark score) checked with transformers 4.52.4**
```python
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device="cuda")
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# LLM Example
# It is recommended to use the chat template with HyperCLOVAX models.
# Using the chat template allows you to easily format your input in ChatML style.
llm_chat = [
{"role": "system", "content": [{"type": "text", "text": "you are helpful assistant!"}]},
{
"role": "user",
"content": [
{"type": "text", "text": "Hello, how are you?"},
{"type": "text", "text": "I said. Hello, how are you today?"},
]
},
{"role": "assistant", "content": [{"type": "text", "text": "I'm doing great. How can I help you today?"}]},
{"role": "user", "content": [{"type": "text", "text": "I'd like to show off how chat templating works!"}]},
]
model_inputs = processor.apply_chat_template(
llm_chat, tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True
)
model_inputs = model_inputs.to(device="cuda")
# Please adjust parameters like top_p appropriately for your use case.
output_ids = model.generate(
**model_inputs,
max_new_tokens=64,
do_sample=True,
top_p=0.6,
temperature=0.5,
repetition_penalty=1.0,
)
print("=" * 80)
print("LLM EXAMPLE")
print(processor.batch_decode(output_ids)[0])
print("=" * 80)
# VLM Example
# For images and videos, you can use url, local_path, base64, or bytes as input sources.
vlm_chat = [
{"role": "system", "content": [{"text": "System Prompt", "type": "text"}]},
{"role": "user", "content": [{"text": "User Text Prompt 1", "type": "text"}]},
{
"role": "user",
"content": [{
"filename": "tradeoff_sota.png",
"image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff_sota.png?raw=true",
"lens_keywords": "Gucci Ophidia, cross bag, Ophidia small, GG, Supreme shoulder bag",
"lens_local_keywords": "[0.07, 0.21, 0.92, 0.90] Gucci Ophidia",
"ocr": "List the words in the image in raster order. Even if the word order feels unnatural for reading, the model will handle it as long as it follows raster order.", "type": "image",
}],
},
{
"role": "user",
"content": [{
"filename": "tradeoff.png",
"image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff.png?raw=true",
"type": "image",
}],
},
{"role": "assistant", "content": [{"text": "Assistant Text Prompt 1", "type": "text"}]},
{"role": "user", "content": [{"text": "User Text Prompt 2", "type": "text"}]},
{
"role": "user",
"content": [
{
"type": "video",
"video": "freenaturestock-rolling-mist-clouds.mp4",
"lens_keywords": "Prada re-edition, nylon bag, mini cross bag, logo strap, essential shoulder bag",
"lens_local_keywords": "[0.12, 0.34, 0.85, 0.76] Prada re-edition",
"speech_to_text": "Please enter the dialogue, voice, sound, lines, and words in the video in text format.",
},
{"text": "User Text Prompt 3", "type": "text"},
]
},
]
model_inputs = processor.apply_chat_template(
vlm_chat, tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True,
)
model_inputs = model_inputs.to(device="cuda")
output_ids = model.generate(
**model_inputs,
max_new_tokens=64,
do_sample=True,
top_p=0.6,
temperature=0.5,
repetition_penalty=1.0,
)
print("=" * 80)
print("VLM EXAMPLE")
print(processor.batch_decode(output_ids)[0])
print("=" * 80)
```
## Example for v0.1.0
**(code & benchmark score) checked with transformers 4.45.0**
```python
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B"
revision="v0.1.0"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, revision=revision).to(device="cuda")
preprocessor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True, revision=revision)
tokenizer = AutoTokenizer.from_pretrained(model_name, revision=revision)
# LLM Example
# It is recommended to use the chat template with HyperCLOVAX models.
# Using the chat template allows you to easily format your input in ChatML style.
chat = [
{"role": "system", "content": "you are helpful assistant!"},
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
{"role": "user", "content": "I'd like to show off how chat templating works!"},
]
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt", tokenize=True)
input_ids = input_ids.to(device="cuda")
# Please adjust parameters like top_p appropriately for your use case.
output_ids = model.generate(
input_ids,
max_new_tokens=64,
do_sample=True,
top_p=0.6,
temperature=0.5,
repetition_penalty=1.0,
)
print("=" * 80)
print("LLM EXAMPLE")
print(tokenizer.batch_decode(output_ids)[0])
print("=" * 80)
# VLM Example
# For image and video inputs, you can use url, local_path, base64, or bytes.
vlm_chat = [
{"role": "system", "content": {"type": "text", "text": "System Prompt"}},
{"role": "user", "content": {"type": "text", "text": "User Text 1"}},
{
"role": "user",
"content": {
"type": "image",
"filename": "tradeoff_sota.png",
"image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff_sota.png?raw=true",
"ocr": "List the words in the image in raster order. Even if the word order feels unnatural for reading, the model will handle it as long as it follows raster order.",
"lens_keywords": "Gucci Ophidia, cross bag, Ophidia small, GG, Supreme shoulder bag",
"lens_local_keywords": "[0.07, 0.21, 0.92, 0.90] Gucci Ophidia",
}
},
{
"role": "user",
"content": {
"type": "image",
"filename": "tradeoff.png",
"image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff.png?raw=true",
}
},
{"role": "assistant", "content": {"type": "text", "text": "Assistant Text 1"}},
{"role": "user", "content": {"type": "text", "text": "User Text 2"}},
{
"role": "user",
"content": {
"type": "video",
"filename": "rolling-mist-clouds.mp4",
"video": "freenaturestock-rolling-mist-clouds.mp4",
}
},
{"role": "user", "content": {"type": "text", "text": "User Text 3"}},
]
new_vlm_chat, all_images, is_video_list = preprocessor.load_images_videos(vlm_chat)
preprocessed = preprocessor(all_images, is_video_list=is_video_list)
input_ids = tokenizer.apply_chat_template(
new_vlm_chat, return_tensors="pt", tokenize=True, add_generation_prompt=True,
)
output_ids = model.generate(
input_ids=input_ids.to(device="cuda"),
max_new_tokens=8192,
do_sample=True,
top_p=0.6,
temperature=0.5,
repetition_penalty=1.0,
**preprocessed,
)
print("=" * 80)
print("VLM EXAMPLE")
print(tokenizer.batch_decode(output_ids)[0])
print("=" * 80)
```
- To ensure the highest level of image understanding performance, it is recommended to include additional information such as Optical Character Recognition (OCR) results and entity recognition (Lens). The provided usage examples are written under the assumption that OCR and Lens results are available. If you input data in this format, you can expect significantly improved output quality.
## vLLM
To speed up your inference, you can use the vLLM engine from [our repository](https://github.com/NAVER-Cloud-HyperCLOVA-X/vllm/tree/v0.9.2rc2_hyperclovax_vision_seed).
Make sure to switch to the `v0.9.2rc2_hyperclovax_vision_seed` branch.
**Launch API server**:
- https://oss.navercorp.com/HYPERSCALE-AI-VISION/vllm/blob/main/README.md
**Request Example**:
- https://github.com/vllm-project/vllm/pull/20931#issue-3229161410
**Offline Inference Examples**:
- https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/vision_language.py
- https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/vision_language_multi_image.py |