File size: 1,605 Bytes
7b9f868 680764c 7b9f868 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
---
license: apache-2.0
inference: false
pipeline_tag: image-text-to-text
datasets:
- liuhaotian/LLaVA-Pretrain
- lmms-lab/LLaVA-ReCap-CC12M
- lmms-lab/LLaVA-NeXT-Data
---
<br>
<br>
# ViCToR Model Card
## Model details
**Paper or resources for more information:**
https://github.com/deepglint/Victor
**Where to send questions or comments about the model:**
https://github.com/deepglint/Victor/issues
## Results
| Benchmark | ViCTOR-7B | LLaVA-1.5-13B | LLaVA-NeXT-8B | Ross |
| ---------------- | --------- | ------------- | ------------- | ---- |
| MMStar | **54.3** | 34.3 | 43.9 | 53.9 |
| RealWorldQA | **65.6** | 55.3 | 58.4 | 58.7 |
| MMBench^(cn,val) | **79.0** | 67.8 | – | – |
| OCRBench | 556 | 337 | 531 | 553 |
| POPE | 88.4 | 88.4 | 87.1 | 88.1 |
| MMU | 48.9 | 37.0 | 43.1 | 49.0 |
| A12D | 79.5 | 61.1 | 72.8 | 79.5 |
| MME | 2071 | 1781 | 1908 | 1854 |
| SEED^(f) | **75.7** | 68.2 | 72.5 | 73.6 |
## Citation
```
@inproceedings{Xie2024ViCToRIV,
title={ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs},
author={Yin Xie and Kaicheng Yang and Peirou Liang and Xiang An and Yongle Zhao and Yumeng Wang and Ziyong Feng and Roy Miles and Ismail Elezi and Jiankang Deng},
year={2024},
url={https://api.semanticscholar.org/CorpusID:273482504}
}
```
|