Image-Text-to-Text
Safetensors
qwen2
conversational
File size: 1,605 Bytes
7b9f868
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
680764c
7b9f868
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
license: apache-2.0
inference: false
pipeline_tag: image-text-to-text
datasets:
- liuhaotian/LLaVA-Pretrain
- lmms-lab/LLaVA-ReCap-CC12M
- lmms-lab/LLaVA-NeXT-Data
---

<br>
<br>

# ViCToR Model Card

## Model details

**Paper or resources for more information:**
https://github.com/deepglint/Victor


**Where to send questions or comments about the model:**
https://github.com/deepglint/Victor/issues


## Results
| Benchmark        | ViCTOR-7B | LLaVA-1.5-13B | LLaVA-NeXT-8B | Ross |
| ---------------- | --------- | ------------- | ------------- | ---- |
| MMStar           | **54.3**  | 34.3          | 43.9          | 53.9 |
| RealWorldQA      | **65.6**  | 55.3          | 58.4          | 58.7 |
| MMBench^(cn,val) | **79.0**  | 67.8          | –             | –    |
| OCRBench         | 556       | 337           | 531           | 553  |
| POPE             | 88.4      | 88.4          | 87.1          | 88.1 |
| MMU              | 48.9      | 37.0          | 43.1          | 49.0 |
| A12D             | 79.5      | 61.1          | 72.8          | 79.5 |
| MME              | 2071      | 1781          | 1908          | 1854 |
| SEED^(f)         | **75.7**  | 68.2          | 72.5          | 73.6 |

## Citation
```
@inproceedings{Xie2024ViCToRIV,
  title={ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs},
  author={Yin Xie and Kaicheng Yang and Peirou Liang and Xiang An and Yongle Zhao and Yumeng Wang and Ziyong Feng and Roy Miles and Ismail Elezi and Jiankang Deng},
  year={2024},
  url={https://api.semanticscholar.org/CorpusID:273482504}
}
```