--- license: apache-2.0 inference: false pipeline_tag: image-text-to-text datasets: - liuhaotian/LLaVA-Pretrain - lmms-lab/LLaVA-ReCap-CC12M - lmms-lab/LLaVA-NeXT-Data ---

# ViCToR Model Card ## Model details **Paper or resources for more information:** https://github.com/deepglint/Victor **Where to send questions or comments about the model:** https://github.com/deepglint/Victor/issues ## Results | Benchmark | ViCTOR-7B | LLaVA-1.5-13B | LLaVA-NeXT-8B | Ross | | ---------------- | --------- | ------------- | ------------- | ---- | | MMStar | **54.3** | 34.3 | 43.9 | 53.9 | | RealWorldQA | **65.6** | 55.3 | 58.4 | 58.7 | | MMBench^(cn,val) | **79.0** | 67.8 | – | – | | OCRBench | 556 | 337 | 531 | 553 | | POPE | 88.4 | 88.4 | 87.1 | 88.1 | | MMU | 48.9 | 37.0 | 43.1 | 49.0 | | A12D | 79.5 | 61.1 | 72.8 | 79.5 | | MME | 2071 | 1781 | 1908 | 1854 | | SEED^(f) | **75.7** | 68.2 | 72.5 | 73.6 | ## Citation ``` @inproceedings{Xie2024ViCToRIV, title={ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs}, author={Yin Xie and Kaicheng Yang and Peirou Liang and Xiang An and Yongle Zhao and Yumeng Wang and Ziyong Feng and Roy Miles and Ismail Elezi and Jiankang Deng}, year={2024}, url={https://api.semanticscholar.org/CorpusID:273482504} } ```