|  | --- | 
					
						
						|  | license: apache-2.0 | 
					
						
						|  | language: | 
					
						
						|  | - en | 
					
						
						|  | base_model: | 
					
						
						|  | - ibm-granite/granite-vision-3.3-2b-preview | 
					
						
						|  | library_name: transformers | 
					
						
						|  | --- | 
					
						
						|  | # granite-vision-embedding-3.3-2b | 
					
						
						|  | **Model Summary:** | 
					
						
						|  | granite-vision-embedding-3.3-2b is an efficient embedding model, based on granite-vision Vision Language Model(VLM). This model is specifically designed for multi-modal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layout. The model generates ColBERT-style multi-vector representations of pages. | 
					
						
						|  | The model eliminates the need for OCR-based text extraction and related preprocessing steps. | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | **Evaluations:** | 
					
						
						|  | We evaluated granite-vision-embedding-3.3-2b alongside other top colBERT style multi-modal embedding models in the 1B-3B parameter range using two benchmark: Vidore2 and [Real-MM-RAG-Bench](https://arxiv.org/abs/2502.12342) which are specifically addressing complex multi-modal documents retrieval task. | 
					
						
						|  |  | 
					
						
						|  | ## **NDCG@5 - ViDoRe V2** | 
					
						
						|  | | Collection \ Model                     | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b | | 
					
						
						|  | |----------------------------------------|--------------|------------------|-------------|--------------------------| | 
					
						
						|  | | ESG Restaurant Human                   | 51.10        | 68.40            | 65.80       | 60.00                    | | 
					
						
						|  | | Economics Macro Multilingual           | 49.90        | 56.50            | 55.40       | 50.13                    | | 
					
						
						|  | | MIT Biomedical                         | 59.70        | 63.60            | 63.50       | 60.00                    | | 
					
						
						|  | | ESG Restaurant Synthetic               | 57.00        | 57.40            | 56.60       | 54.00                    | | 
					
						
						|  | | ESG Restaurant Synthetic Multilingual  | 55.70        | 57.40            | 57.20       | 52.00                    | | 
					
						
						|  | | MIT Biomedical Multilingual            | 56.50        | 61.10            | 62.50       | 54.00                    | | 
					
						
						|  | | Economics Macro                        | 51.60        | 59.80            | 60.20       | 57.00                    | | 
					
						
						|  | | **Avg (ViDoRe2)**                      | **54.50**    | **60.60**        | **60.17**   | **55.20**                    | | 
					
						
						|  |  | 
					
						
						|  | ## **NDCG@5 - REAL-MM-RAG** | 
					
						
						|  | | Collection \ Model                     | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b | | 
					
						
						|  | |----------------------------------------|--------------|------------------|-------------|--------------------------| | 
					
						
						|  | | FinReport                              | 0.55         | 0.66             | 0.78        | 0.60                     | | 
					
						
						|  | | FinSlides                              | 0.68         | 0.79             | 0.81        | 0.72                     | | 
					
						
						|  | | TechReport                             | 0.78         | 0.86             | 0.88        | 0.80                     | | 
					
						
						|  | | TechSlides                             | 0.90         | 0.93             | 0.92        | 0.92                     | | 
					
						
						|  | | **Avg (REAL-MM-RAG)**                  | **0.73**     | **0.81**         | **0.85**    | **0.79**                 | | 
					
						
						|  |  | 
					
						
						|  | - **Release Date**: June 2025 | 
					
						
						|  | - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | 
					
						
						|  | **Supported Input Format:** | 
					
						
						|  | Currently the model supports English queries and images (png, jpeg, etc.) as input format. | 
					
						
						|  | **Intended Use:** | 
					
						
						|  | The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever. | 
					
						
						|  | ### Usage | 
					
						
						|  | First, make sure to build the latest verions of transormers: | 
					
						
						|  | ```shell | 
					
						
						|  | pip install -q torch torchvision torchaudio | 
					
						
						|  | pip install transformers>=4.49 | 
					
						
						|  | ``` | 
					
						
						|  | Then run the code: | 
					
						
						|  | ```python | 
					
						
						|  | from transformers import AutoProcessor, AutoModel | 
					
						
						|  | from PIL import Image | 
					
						
						|  | import torch | 
					
						
						|  |  | 
					
						
						|  | device = "cuda" if torch.cuda.is_available() else "cpu" | 
					
						
						|  | model_name = "ibm-granite/granite-vision-embedding-3.3-2b" | 
					
						
						|  | model = AutoModel.from_pretrained(model_name, trust_remote_code=True,torch_dtype=torch.float16).to(device).eval() | 
					
						
						|  | processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True) | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | # ───────────────────────────────────────────── | 
					
						
						|  | # Inputs: Image + Text | 
					
						
						|  | # ───────────────────────────────────────────── | 
					
						
						|  | image_url = "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg" | 
					
						
						|  | print("\nFetching image...") | 
					
						
						|  | image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB") | 
					
						
						|  |  | 
					
						
						|  | text = "A photo of a tiger" | 
					
						
						|  | print(f"Image and text inputs ready.") | 
					
						
						|  |  | 
					
						
						|  | # Process both inputs | 
					
						
						|  | print("Processing inputs...") | 
					
						
						|  | image_inputs = processor.process_images([image]) | 
					
						
						|  | text_inputs = processor.process_queries([text]) | 
					
						
						|  |  | 
					
						
						|  | # Move to correct device | 
					
						
						|  | image_inputs = {k: v.to(device) for k, v in image_inputs.items()} | 
					
						
						|  | text_inputs = {k: v.to(device) for k, v in text_inputs.items()} | 
					
						
						|  |  | 
					
						
						|  | # ───────────────────────────────────────────── | 
					
						
						|  | # Run Inference | 
					
						
						|  | # ───────────────────────────────────────────── | 
					
						
						|  | with torch.no_grad(): | 
					
						
						|  | print("🔍 Getting image embedding...") | 
					
						
						|  | img_emb = model(**image_inputs) | 
					
						
						|  |  | 
					
						
						|  | print("✍️ Getting text embedding...") | 
					
						
						|  | txt_emb = model(**text_inputs) | 
					
						
						|  |  | 
					
						
						|  | # ───────────────────────────────────────────── | 
					
						
						|  | # Score the similarity | 
					
						
						|  | # ───────────────────────────────────────────── | 
					
						
						|  | print("Scoring similarity...") | 
					
						
						|  | similarity = processor.score(txt_emb, img_emb, batch_size=1, device=device) | 
					
						
						|  |  | 
					
						
						|  | print("\n" + "=" * 50) | 
					
						
						|  | print(f"📊 Similarity between image and text: {similarity.item():.4f}") | 
					
						
						|  | print("=" * 50) | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  | ### Use granite-vision-embedding-3.3-2b for MM RAG | 
					
						
						|  | For an example of MM RAG using col-granite-visionrefer to [this notebook](......). | 
					
						
						|  |  | 
					
						
						|  | **Model Architecture:** | 
					
						
						|  | We built our model upon [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b) with additional projection layer. | 
					
						
						|  | **Training Data:** | 
					
						
						|  | The model was trained on a random subset from DOCFM. DOCFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF | 
					
						
						|  | documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance) | 
					
						
						|  | reports. For each image in the dataset, Pseudo-questions were generated using Pixtral12B VLM. | 
					
						
						|  | **Infrastructure:** | 
					
						
						|  | We train granite-vision-embedding-3.3-2b on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs. | 
					
						
						|  | **Ethical Considerations and Limitations:** | 
					
						
						|  | The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. col-granite-vision-1.0-2b is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses. | 
					
						
						|  | Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use col-granite-vision-1.0-2b with ethical intentions and in a responsible way. | 
					
						
						|  | **Resources** | 
					
						
						|  | - :page_facing_up: Granite Vision technical report [here](https://arxiv.org/abs/2502.09927) | 
					
						
						|  | - :star:️ Learn about the latest updates with Granite: https://www.ibm.com/granite | 
					
						
						|  | - :rocket: Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ | 
					
						
						|  | - :bulb: Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources | 
					
						
						|  |  |