license: apache-2.0
language:
- en
base_model:
- ibm-granite/granite-vision-3.3-2b-preview
library_name: transformers
granite-vision-embedding-3.3-2b
Model Summary: granite-vision-embedding-3.3-2b is an efficient embedding model, based on granite-vision Vision Language Model(VLM). This model is specifically designed for multi-modal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layout. The model generates ColBERT-style multi-vector representations of pages. The model eliminates the need for OCR-based text extraction and related preprocessing steps.
Evaluations: We evaluated granite-vision-embedding-3.3-2b alongside other top colBERT style multi-modal embedding models in the 1B-3B parameter range using two benchmark: Vidore2 and Real-MM-RAG-Bench which are specifically addressing complex multi-modal documents retrieval task.
NDCG@5 - ViDoRe V2
Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b |
---|---|---|---|---|
ESG Restaurant Human | 51.10 | 68.40 | 65.80 | 60.00 |
Economics Macro Multilingual | 49.90 | 56.50 | 55.40 | 50.13 |
MIT Biomedical | 59.70 | 63.60 | 63.50 | 60.00 |
ESG Restaurant Synthetic | 57.00 | 57.40 | 56.60 | 54.00 |
ESG Restaurant Synthetic Multilingual | 55.70 | 57.40 | 57.20 | 52.00 |
MIT Biomedical Multilingual | 56.50 | 61.10 | 62.50 | 54.00 |
Economics Macro | 51.60 | 59.80 | 60.20 | 57.00 |
Avg (ViDoRe2) | 54.50 | 60.60 | 60.17 | 55.30 |
NDCG@5 - REAL-MM-RAG
Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b |
---|---|---|---|---|
FinReport | 0.55 | 0.66 | 0.78 | 0.60 |
FinSlides | 0.68 | 0.79 | 0.81 | 0.72 |
TechReport | 0.78 | 0.86 | 0.88 | 0.80 |
TechSlides | 0.90 | 0.93 | 0.92 | 0.92 |
Avg (REAL-MM-RAG) | 0.73 | 0.81 | 0.85 | 0.76 |
- Release Date: June 2025
- License: Apache 2.0 Supported Input Format: Currently the model supports English queries and images (png, jpeg, etc.) as input format. Intended Use: The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever.
Usage
First, make sure to build the latest verions of transormers:
pip install -q torch torchvision torchaudio
pip install transformers>=4.49
Then run the code:
from transformers import AutoProcessor, AutoModel
from PIL import Image
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "ibm-granite/granite-vision-embedding-3.3-2b"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True,torch_dtype=torch.float16).to(device).eval()
processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True)
# ─────────────────────────────────────────────
# Inputs: Image + Text
# ─────────────────────────────────────────────
image_url = "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg"
print("\nFetching image...")
image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")
text = "A photo of a tiger"
print(f"Image and text inputs ready.")
# Process both inputs
print("Processing inputs...")
image_inputs = processor.process_images([image])
text_inputs = processor.process_queries([text])
# Move to correct device
image_inputs = {k: v.to(device) for k, v in image_inputs.items()}
text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
# ─────────────────────────────────────────────
# Run Inference
# ─────────────────────────────────────────────
with torch.no_grad():
print("🔍 Getting image embedding...")
img_emb = model(**image_inputs)
print("✍️ Getting text embedding...")
txt_emb = model(**text_inputs)
# ─────────────────────────────────────────────
# Score the similarity
# ─────────────────────────────────────────────
print("Scoring similarity...")
similarity = processor.score(txt_emb, img_emb, batch_size=1, device=device)
print("\n" + "=" * 50)
print(f"📊 Similarity between image and text: {similarity.item():.4f}")
print("=" * 50)
Use granite-vision-embedding-3.3-2b for MM RAG
For an example of MM RAG using col-granite-visionrefer to this notebook.
Model Architecture: We built our model upon granite-vision-3.3-2b with additional projection layer. Training Data: The model was trained on a random subset from DOCFM. DOCFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance) reports. For each image in the dataset, Pseudo-questions were generated using Pixtral12B VLM. Infrastructure: We train granite-vision-embedding-3.3-2b on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs. Ethical Considerations and Limitations: The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. col-granite-vision-1.0-2b is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use col-granite-vision-1.0-2b with ethical intentions and in a responsible way. Resources
- :page_facing_up: Granite Vision technical report here
- :star:️ Learn about the latest updates with Granite: https://www.ibm.com/granite
- :rocket: Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
- :bulb: Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources