library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- llava
- multimodal
- vision-language-model
- instruction-following
- visual-question-answering
- llama
- llama-3.1
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
datasets:
- liuhaotian/LLaVA-Pretrain
- liuhaotian/LLaVA-Instruct-150K
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
This model is LLaVA_MORE-llama_3_1-8B-finetuning, a part of the LLaVA-MORE family of Multimodal Large Language Models (MLLMs) presented in the paper LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning.
LLaVA-MORE integrates recent language models with diverse visual backbones. The project employs a unified training protocol applied consistently across all architectures to ensure fair comparisons and systematically explores the trade-offs between model size, architecture, and performance.
🔥 LLaVA-MORE 🔥
A Comparative Study of LLMs and Visual Backbones
for Enhanced Visual Instruction Tuning
Model Details
This model is a fine-tuned variant of LLaVA-MORE. It utilizes Meta-Llama-3.1-8B-Instruct as its Large Language Model backbone and openai/clip-vit-large-patch14-336 as its visual backbone. It has been fine-tuned on the LLaVA-Instruct-665K dataset.
- Developed by: Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara
- Shared by: AImageLab
- Model type: Multimodal Large Language Model (MLLM)
- Language(s): English
- License: Apache 2.0
- Finetuned from model: This model is fine-tuned from a LLaVA-MORE pre-trained model (which itself uses
meta-llama/Meta-Llama-3.1-8B-Instructand is pre-trained onliuhaotian/LLaVA-Pretrain).
Model Sources
- Repository: https://github.com/FedericoCocchi/LLaVA-MORE
- Paper: https://huggingface.co/papers/2503.15621
- Project Website: https://aimagelab.ing.unimore.it/imagelab
Uses
Direct Use
This model is intended for various multimodal tasks, including:
- Visual Question Answering (VQA)
- Multimodal reasoning
- Image captioning
- General instruction following related to visual content
Out-of-Scope Use
The model should not be used for:
- Generating harmful, biased, or inappropriate content.
- High-stakes applications without thorough domain-specific evaluation and human oversight.
Bias, Risks, and Limitations
Like all large language models, this model may exhibit biases present in its training data. Performance may vary across different visual domains or languages not adequately represented in the training corpus. Users should be aware of potential hallucinations or incorrect information generation, especially in complex or ambiguous visual contexts.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. It is recommended to implement appropriate safeguards and thorough evaluation for specific high-stakes applications.
How to Get Started with the Model
You can use the Hugging Face transformers library to load and use this model.
import torch
from transformers import AutoProcessor, LlavaLlamaForCausalLM
from PIL import Image
import requests
from io import BytesIO
# Load model and processor
model_path = "aimagelab/LLaVA_MORE-llama_3_1-8B-finetuning"
model = LlavaLlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_path)
# Example image (replace with your image path or a URL)
# For a local image:
# image = Image.open("./path/to/your/image.jpg").convert("RGB")
# For an image from a URL:
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")
# Prepare messages for multi-modal input
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail and tell me what make and model the car is."},
],
}
]
# Apply chat template and process inputs
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(text=[text], images=[image], return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
# Generate response with parameters from generation_config.json
output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.6, top_p=0.95)
generated_text = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(generated_text)
Training Details
LLaVA-MORE models are trained using a unified protocol for fair comparisons. The training process generally involves two stages:
- Pretraining: Models are first pretrained on the LCS-558K dataset.
- Finetuning: Subsequently, models are fine-tuned on the LLaVA-Instruct-665K dataset.
Training Procedure
Training scripts and detailed instructions for distributed training on HPC facilities with a SLURM scheduler are publicly available in the GitHub repository. The training runs were logged using Weights & Biases (WandB).
Training Hyperparameters
- Training regime: Mixed precision (specifics on fp16/bf16 can be found in the paper).
Evaluation
The following table presents the performance of LLaVA-MORE models compared to other versions of LLaVA across various multimodal datasets. The models in bold represent LLaVA-MORE variants.
| Model Name | Text-VQA* | Science-QA | AI2D | SEED-vid | SEED-all | SEED-img | MMMU | MMBench-Cn | MMBench-En | POPE | GQA | MME-P | MME-C |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLaVA-v1.5-7B | 58.2 | 69.0 | 56.4 | 42.0 | 61.6 | 66.8 | 34.2 | 56.5 | 65.3 | 85.6 | 62.4 | 1474.3 | 314.6 |
| LLaVA-v1.5-LLaMA3-8B | 57.6 | 74.2 | 60.7 | 42.0 | 64.3 | 70.1 | 37.3 | 65.4 | 70.3 | 85.4 | 63.5 | 1544.4 | 330.3 |
| LLaVA-v1.5-LLaMA3_1-8B | 58.4 | 76.3 | 61.8 | 42.4 | 64.1 | 69.8 | 39.4 | 68.2 | 72.4 | 85.1 | 63.6 | 1531.5 | 353.3 |
| LLaVA-v1.5-LLaMA3_1-8B-S2 | 60.9 | 76.7 | 62.2 | 42.3 | 64.2 | 69.9 | 38.7 | 65.8 | 71.1 | 86.5 | 64.5 | 1563.8 | 293.2 |
| LLaVA-v1.5-LLaMA3_1-8B-siglip | 62.1 | 77.5 | 63.6 | 46.1 | 65.8 | 71.0 | 39.8 | 68.2 | 73.1 | 86.1 | 64.6 | 1531.0 | 315.4 |
| LLaVA-v1.5-LLaMA3_1-8B-S2-siglip | 63.5 | 77.1 | 62.7 | 44.7 | 65.5 | 71.0 | 40.0 | 68.0 | 71.8 | 86.0 | 64.9 | 1541.4 | 336.4 |
| LLaVA-v1.5-Phi_4-4B | 54.0 | 71.3 | 61.1 | 42.3 | 63.5 | 69.1 | 38.8 | 64.2 | 69.2 | 85.9 | 62.1 | 1372.2 | 281.1 |
| LLaVA-v1.5-gemma_2-9B | 60.7 | 75.4 | 64.8 | 44.1 | 64.5 | 69.9 | 37.9 | 65.9 | 71.9 | 86.8 | 64.2 | 1522.5 | 307.5 |
| LLaVA-v1.5-gemma_2-9B-siglip2 | 66.7 | 76.2 | 65.3 | 46.0 | 67.5 | 73.1 | 38.7 | 68.0 | 72.0 | 86.1 | 65.6 | 1510.9 | 308.2 |
| LLaVA-v1.5-Distill-LLaMA-8B | 56.3 | 74.5 | 58.8 | 43.5 | 63.5 | 68.6 | 38.1 | 66.8 | 61.3 | 85.1 | 63.0 | 1495.1 | 295.0 |
The evaluation protocol can be reproduced using the lmms-eval library. Detailed instructions and scripts are available in the GitHub repository.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). Specifics for this model's training are available in the paper and project repository.
- Cloud Provider: CINECA
- Compute Region: Italy (assuming CINECA's primary location)
Citation
If you make use of our work, please cite our paper:
@inproceedings{cocchi2025llava,
title={{LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning}},
author={Cocchi, Federico and Moratelli, Nicholas and Caffagni, Davide and Sarto, Sara and Baraldi, Lorenzo and Cornia, Marcella and Cucchiara, Rita},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},
year={2025}
}
Acknowledgments
We thank the LLaVA team for open-sourcing a modular codebase to extend and train different models within the LLaVA family. We are also happy users of the lmms-eval library, which has significantly reduced the evaluation time of our checkpoints across different datasets.
We also thank CINECA for the availability of high-performance computing resources used to train LLaVA-MORE. This work is supported by the PNRR-M4C2 project FAIR - Future Artificial Intelligence Research and by the PNRR project ITSERR - Italian Strengthening of Esfri RI Resilience.
Model Card Contact
[More Information Needed]