|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- ibm-granite/granite-vision-3.3-2b-preview |
|
library_name: transformers |
|
--- |
|
# granite-vision-embedding-3.3-2b |
|
**Model Summary:** |
|
granite-vision-embedding-3.3-2b is an efficient embedding model, based on granite-vision Vision Language Model(VLM). This model is specifically designed for multi-modal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layout. The model generates ColBERT-style multi-vector representations of pages. |
|
The model eliminates the need for OCR-based text extraction and related preprocessing steps. |
|
|
|
|
|
**Evaluations:** |
|
We evaluated granite-vision-embedding-3.3-2b alongside other top colBERT style multi-modal embedding models in the 1B-3B parameter range using two benchmark: Vidore2 and [Real-MM-RAG-Bench](https://arxiv.org/abs/2502.12342) which are specifically addressing complex multi-modal documents retrieval task. |
|
|
|
## **NDCG@5 - ViDoRe V2** |
|
| Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b | |
|
|----------------------------------------|--------------|------------------|-------------|--------------------------| |
|
| ESG Restaurant Human | 51.10 | 68.40 | 65.80 | 60.00 | |
|
| Economics Macro Multilingual | 49.90 | 56.50 | 55.40 | 50.13 | |
|
| MIT Biomedical | 59.70 | 63.60 | 63.50 | 60.00 | |
|
| ESG Restaurant Synthetic | 57.00 | 57.40 | 56.60 | 54.00 | |
|
| ESG Restaurant Synthetic Multilingual | 55.70 | 57.40 | 57.20 | 52.00 | |
|
| MIT Biomedical Multilingual | 56.50 | 61.10 | 62.50 | 54.00 | |
|
| Economics Macro | 51.60 | 59.80 | 60.20 | 57.00 | |
|
| **Avg (ViDoRe2)** | **54.50** | **60.60** | **60.17** | **55.30** | |
|
|
|
## **NDCG@5 - REAL-MM-RAG** |
|
| Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b | |
|
|----------------------------------------|--------------|------------------|-------------|--------------------------| |
|
| FinReport | 0.55 | 0.66 | 0.78 | 0.60 | |
|
| FinSlides | 0.68 | 0.79 | 0.81 | 0.72 | |
|
| TechReport | 0.78 | 0.86 | 0.88 | 0.80 | |
|
| TechSlides | 0.90 | 0.93 | 0.92 | 0.92 | |
|
| **Avg (REAL-MM-RAG)** | **0.73** | **0.81** | **0.85** | **0.76** | |
|
|
|
- **Release Date**: June 2025 |
|
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
**Supported Input Format:** |
|
Currently the model supports English queries and images (png, jpeg, etc.) as input format. |
|
**Intended Use:** |
|
The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever. |
|
### Usage |
|
First, make sure to build the latest verions of transormers: |
|
```shell |
|
pip install -q torch torchvision torchaudio |
|
pip install transformers>=4.49 |
|
``` |
|
Then run the code: |
|
```python |
|
from transformers import AutoProcessor, AutoModel |
|
from PIL import Image |
|
import torch |
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model_name = "ibm-granite/granite-vision-embedding-3.3-2b" |
|
model = AutoModel.from_pretrained(model_name, trust_remote_code=True,torch_dtype=torch.float16).to(device).eval() |
|
processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True) |
|
|
|
|
|
# ───────────────────────────────────────────── |
|
# Inputs: Image + Text |
|
# ───────────────────────────────────────────── |
|
image_url = "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg" |
|
print("\nFetching image...") |
|
image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB") |
|
|
|
text = "A photo of a tiger" |
|
print(f"Image and text inputs ready.") |
|
|
|
# Process both inputs |
|
print("Processing inputs...") |
|
image_inputs = processor.process_images([image]) |
|
text_inputs = processor.process_queries([text]) |
|
|
|
# Move to correct device |
|
image_inputs = {k: v.to(device) for k, v in image_inputs.items()} |
|
text_inputs = {k: v.to(device) for k, v in text_inputs.items()} |
|
|
|
# ───────────────────────────────────────────── |
|
# Run Inference |
|
# ───────────────────────────────────────────── |
|
with torch.no_grad(): |
|
print("🔍 Getting image embedding...") |
|
img_emb = model(**image_inputs) |
|
|
|
print("✍️ Getting text embedding...") |
|
txt_emb = model(**text_inputs) |
|
|
|
# ───────────────────────────────────────────── |
|
# Score the similarity |
|
# ───────────────────────────────────────────── |
|
print("Scoring similarity...") |
|
similarity = processor.score(txt_emb, img_emb, batch_size=1, device=device) |
|
|
|
print("\n" + "=" * 50) |
|
print(f"📊 Similarity between image and text: {similarity.item():.4f}") |
|
print("=" * 50) |
|
|
|
``` |
|
### Use granite-vision-embedding-3.3-2b for MM RAG |
|
For an example of MM RAG using col-granite-visionrefer to [this notebook](......). |
|
|
|
**Model Architecture:** |
|
We built our model upon [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b) with additional projection layer. |
|
**Training Data:** |
|
The model was trained on a random subset from DOCFM. DOCFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF |
|
documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance) |
|
reports. For each image in the dataset, Pseudo-questions were generated using Pixtral12B VLM. |
|
**Infrastructure:** |
|
We train granite-vision-embedding-3.3-2b on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs. |
|
**Ethical Considerations and Limitations:** |
|
The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. col-granite-vision-1.0-2b is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses. |
|
Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use col-granite-vision-1.0-2b with ethical intentions and in a responsible way. |
|
**Resources** |
|
- :page_facing_up: Granite Vision technical report [here](https://arxiv.org/abs/2502.09927) |
|
- :star:️ Learn about the latest updates with Granite: https://www.ibm.com/granite |
|
- :rocket: Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ |
|
- :bulb: Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources |
|
|