ydeng9 commited on
Commit
12de3fc
·
verified ·
1 Parent(s): 80d28b0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ pipeline_tag: image-text-to-text
7
+ ---
8
+
9
+ ## Overview
10
+ OpenVLThinker-7B-v1.2 is a vision-language reasoning model designed to handle multimodal tasks. It is especially tuned for visual mathematical problem-solving.
11
+
12
+ For more details: [Paper](https://arxiv.org/abs/2503.17352), [GitHub](https://github.com/yihedeng9/OpenVLThinker)
13
+
14
+ ## How to use
15
+ ```python
16
+ from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
17
+ import torch
18
+ from qwen_vl_utils import process_vision_info
19
+ import requests
20
+ from PIL import Image
21
+
22
+ # 1. Define model and processor names
23
+ model_name = "ydeng9/OpenVLThinker-7B-v1.2"
24
+ processor_name = "Qwen/Qwen2.5-VL-7B-Instruct"
25
+
26
+ # 2. Load the OpenVLThinker-7B model and processor
27
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
28
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
29
+ model_name,
30
+ torch_dtype=torch.bfloat16,
31
+ attn_implementation="flash_attention_2",
32
+ device_map=device
33
+ )
34
+ processor = AutoProcessor.from_pretrained(processor_name)
35
+
36
+ # 3. Define a sample image URL and an instruction
37
+ image_url = "https://example.com/sample_image.jpg" # replace with your image URL
38
+ instruction = "Example question"
39
+
40
+ # 4. Create a multimodal prompt using a chat message structure
41
+ messages = [
42
+ {
43
+ "role": "user",
44
+ "content": [
45
+ {"type": "image", "image": image_url},
46
+ {"type": "text", "text": instruction},
47
+ ],
48
+ }
49
+ ]
50
+
51
+ # 5. Generate a text prompt from the chat messages
52
+ text_prompt = processor.apply_chat_template(
53
+ messages, tokenize=False, add_generation_prompt=True
54
+ )
55
+
56
+ # 6. Process image (and video) inputs from the messages
57
+ image_inputs, video_inputs = process_vision_info(messages)
58
+ inputs = processor(
59
+ text=[text_prompt],
60
+ images=image_inputs,
61
+ videos=video_inputs,
62
+ padding=True,
63
+ return_tensors="pt",
64
+ ).to(device)
65
+
66
+ # 7. Generate the model's response (with specified generation parameters)
67
+ generated_ids = model.generate(
68
+ **inputs,
69
+ do_sample=True,
70
+ max_new_tokens=2048,
71
+ top_p=0.001,
72
+ top_k=1,
73
+ temperature=0.01,
74
+ repetition_penalty=1.0,
75
+ )
76
+
77
+ # 8. Decode the generated tokens into human-readable text
78
+ generated_text = processor.batch_decode(
79
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
80
+ )[0]
81
+
82
+ # 9. Print the generated response
83
+ print("Generated Response:")
84
+ print(generated_text)
85
+ ```
86
+
87
+ ### Citation
88
+ ```text
89
+ @misc{deng2025openvlthinker,
90
+ title={OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles},
91
+ author={Yihe Deng and Hritik Bansal and Fan Yin and Nanyun Peng and Wei Wang and Kai-Wei Chang},
92
+ year={2025},
93
+ eprint={2503.17352},
94
+ archivePrefix={arXiv},
95
+ primaryClass={cs.CV},
96
+ url={https://arxiv.org/abs/2503.17352},
97
+ }
98
+ ```