Improve model card for LLaVA_MORE-llama_3_1-8B-finetuning

6338a49 verified 3 months ago

10.7 kB

metadata

library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
  - llava
  - multimodal
  - vision-language-model
  - instruction-following
  - visual-question-answering
  - llama
  - llama-3.1
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
datasets:
  - liuhaotian/LLaVA-Pretrain
  - liuhaotian/LLaVA-Instruct-150K

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

This model is LLaVA_MORE-llama_3_1-8B-finetuning, a part of the LLaVA-MORE family of Multimodal Large Language Models (MLLMs) presented in the paper LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning.

LLaVA-MORE integrates recent language models with diverse visual backbones. The project employs a unified training protocol applied consistently across all architectures to ensure fair comparisons and systematically explores the trade-offs between model size, architecture, and performance.

🔥 LLaVA-MORE 🔥
A Comparative Study of LLMs and Visual Backbones
for Enhanced Visual Instruction Tuning

Model Details

This model is a fine-tuned variant of LLaVA-MORE. It utilizes Meta-Llama-3.1-8B-Instruct as its Large Language Model backbone and openai/clip-vit-large-patch14-336 as its visual backbone. It has been fine-tuned on the LLaVA-Instruct-665K dataset.

Developed by: Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara
Shared by: AImageLab
Model type: Multimodal Large Language Model (MLLM)
Language(s): English
License: Apache 2.0
Finetuned from model: This model is fine-tuned from a LLaVA-MORE pre-trained model (which itself uses meta-llama/Meta-Llama-3.1-8B-Instruct and is pre-trained on liuhaotian/LLaVA-Pretrain).

Model Sources

Repository: https://github.com/FedericoCocchi/LLaVA-MORE
Paper: https://huggingface.co/papers/2503.15621
Project Website: https://aimagelab.ing.unimore.it/imagelab

Uses

Direct Use

This model is intended for various multimodal tasks, including:

Visual Question Answering (VQA)
Multimodal reasoning
Image captioning
General instruction following related to visual content

Out-of-Scope Use

The model should not be used for:

Generating harmful, biased, or inappropriate content.
High-stakes applications without thorough domain-specific evaluation and human oversight.

Bias, Risks, and Limitations

Like all large language models, this model may exhibit biases present in its training data. Performance may vary across different visual domains or languages not adequately represented in the training corpus. Users should be aware of potential hallucinations or incorrect information generation, especially in complex or ambiguous visual contexts.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. It is recommended to implement appropriate safeguards and thorough evaluation for specific high-stakes applications.

How to Get Started with the Model

You can use the Hugging Face transformers library to load and use this model.

import torch
from transformers import AutoProcessor, LlavaLlamaForCausalLM
from PIL import Image
import requests
from io import BytesIO

# Load model and processor
model_path = "aimagelab/LLaVA_MORE-llama_3_1-8B-finetuning"
model = LlavaLlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_path)

# Example image (replace with your image path or a URL)
# For a local image:
# image = Image.open("./path/to/your/image.jpg").convert("RGB")
# For an image from a URL:
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")

# Prepare messages for multi-modal input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail and tell me what make and model the car is."},
        ],
    }
]

# Apply chat template and process inputs
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(text=[text], images=[image], return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate response with parameters from generation_config.json
output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.6, top_p=0.95)
generated_text = processor.batch_decode(output_ids, skip_special_tokens=True)[0]

print(generated_text)

Training Details

LLaVA-MORE models are trained using a unified protocol for fair comparisons. The training process generally involves two stages:

Pretraining: Models are first pretrained on the LCS-558K dataset.
Finetuning: Subsequently, models are fine-tuned on the LLaVA-Instruct-665K dataset.

Training Procedure

Training scripts and detailed instructions for distributed training on HPC facilities with a SLURM scheduler are publicly available in the GitHub repository. The training runs were logged using Weights & Biases (WandB).

Training Hyperparameters

Training regime: Mixed precision (specifics on fp16/bf16 can be found in the paper).

Evaluation

The following table presents the performance of LLaVA-MORE models compared to other versions of LLaVA across various multimodal datasets. The models in bold represent LLaVA-MORE variants.

Model Name	Text-VQA*	Science-QA	AI2D	SEED-vid	SEED-all	SEED-img	MMMU	MMBench-Cn	MMBench-En	POPE	GQA	MME-P	MME-C
LLaVA-v1.5-7B	58.2	69.0	56.4	42.0	61.6	66.8	34.2	56.5	65.3	85.6	62.4	1474.3	314.6
LLaVA-v1.5-LLaMA3-8B	57.6	74.2	60.7	42.0	64.3	70.1	37.3	65.4	70.3	85.4	63.5	1544.4	330.3
LLaVA-v1.5-LLaMA3_1-8B	58.4	76.3	61.8	42.4	64.1	69.8	39.4	68.2	72.4	85.1	63.6	1531.5	353.3
LLaVA-v1.5-LLaMA3_1-8B-S2	60.9	76.7	62.2	42.3	64.2	69.9	38.7	65.8	71.1	86.5	64.5	1563.8	293.2
LLaVA-v1.5-LLaMA3_1-8B-siglip	62.1	77.5	63.6	46.1	65.8	71.0	39.8	68.2	73.1	86.1	64.6	1531.0	315.4
LLaVA-v1.5-LLaMA3_1-8B-S2-siglip	63.5	77.1	62.7	44.7	65.5	71.0	40.0	68.0	71.8	86.0	64.9	1541.4	336.4
LLaVA-v1.5-Phi_4-4B	54.0	71.3	61.1	42.3	63.5	69.1	38.8	64.2	69.2	85.9	62.1	1372.2	281.1
LLaVA-v1.5-gemma_2-9B	60.7	75.4	64.8	44.1	64.5	69.9	37.9	65.9	71.9	86.8	64.2	1522.5	307.5
LLaVA-v1.5-gemma_2-9B-siglip2	66.7	76.2	65.3	46.0	67.5	73.1	38.7	68.0	72.0	86.1	65.6	1510.9	308.2
LLaVA-v1.5-Distill-LLaMA-8B	56.3	74.5	58.8	43.5	63.5	68.6	38.1	66.8	61.3	85.1	63.0	1495.1	295.0

* *The results of TextVQA are computed with OCR token in the input prompt.*

The evaluation protocol can be reproduced using the lmms-eval library. Detailed instructions and scripts are available in the GitHub repository.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). Specifics for this model's training are available in the paper and project repository.

Cloud Provider: CINECA
Compute Region: Italy (assuming CINECA's primary location)

Citation

If you make use of our work, please cite our paper:

@inproceedings{cocchi2025llava,
      title={{LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning}},
      author={Cocchi, Federico and Moratelli, Nicholas and Caffagni, Davide and Sarto, Sara and Baraldi, Lorenzo and Cornia, Marcella and Cucchiara, Rita},
      booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},
      year={2025}
}

Acknowledgments

We thank the LLaVA team for open-sourcing a modular codebase to extend and train different models within the LLaVA family. We are also happy users of the lmms-eval library, which has significantly reduced the evaluation time of our checkpoints across different datasets.

We also thank CINECA for the availability of high-performance computing resources used to train LLaVA-MORE. This work is supported by the PNRR-M4C2 project FAIR - Future Artificial Intelligence Research and by the PNRR project ITSERR - Italian Strengthening of Esfri RI Resilience.

Model Card Contact

[More Information Needed]