granite-vision-3.3-2b-embedding / README.md

Update README.md

a6750a3 verified 4 months ago

8 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- ibm-granite/granite-vision-3.3-2b-preview
	library_name: transformers
	---
	# granite-vision-embedding-3.3-2b
	Model Summary:
	granite-vision-embedding-3.3-2b is an efficient embedding model, based on granite-vision Vision Language Model(VLM). This model is specifically designed for multi-modal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layout. The model generates ColBERT-style multi-vector representations of pages.
	The model eliminates the need for OCR-based text extraction and related preprocessing steps.


	Evaluations:
	We evaluated granite-vision-embedding-3.3-2b alongside other top colBERT style multi-modal embedding models in the 1B-3B parameter range using two benchmark: Vidore2 and [Real-MM-RAG-Bench](https://arxiv.org/abs/2502.12342) which are specifically addressing complex multi-modal documents retrieval task.

	## NDCG@5 - ViDoRe V2
	\| Collection \ Model \| ColPali-v1.3 \| ColQwen2.5-v0.2 \| ColNomic-3b \| ColGraniteVision-3.3-2b \|
	\|----------------------------------------\|--------------\|------------------\|-------------\|--------------------------\|
	\| ESG Restaurant Human \| 51.10 \| 68.40 \| 65.80 \| 60.00 \|
	\| Economics Macro Multilingual \| 49.90 \| 56.50 \| 55.40 \| 50.13 \|
	\| MIT Biomedical \| 59.70 \| 63.60 \| 63.50 \| 60.00 \|
	\| ESG Restaurant Synthetic \| 57.00 \| 57.40 \| 56.60 \| 54.00 \|
	\| ESG Restaurant Synthetic Multilingual \| 55.70 \| 57.40 \| 57.20 \| 52.00 \|
	\| MIT Biomedical Multilingual \| 56.50 \| 61.10 \| 62.50 \| 54.00 \|
	\| Economics Macro \| 51.60 \| 59.80 \| 60.20 \| 57.00 \|
	\| Avg (ViDoRe2) \| 54.50 \| 60.60 \| 60.17 \| 55.30 \|

	## NDCG@5 - REAL-MM-RAG
	\| Collection \ Model \| ColPali-v1.3 \| ColQwen2.5-v0.2 \| ColNomic-3b \| ColGraniteVision-3.3-2b \|
	\|----------------------------------------\|--------------\|------------------\|-------------\|--------------------------\|
	\| FinReport \| 0.55 \| 0.66 \| 0.78 \| 0.60 \|
	\| FinSlides \| 0.68 \| 0.79 \| 0.81 \| 0.72 \|
	\| TechReport \| 0.78 \| 0.86 \| 0.88 \| 0.80 \|
	\| TechSlides \| 0.90 \| 0.93 \| 0.92 \| 0.92 \|
	\| Avg (REAL-MM-RAG) \| 0.73 \| 0.81 \| 0.85 \| 0.76 \|

	- Release Date: June 2025
	- License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
	Supported Input Format:
	Currently the model supports English queries and images (png, jpeg, etc.) as input format.
	Intended Use:
	The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever.
	### Usage
	First, make sure to build the latest verions of transormers:
	```shell
	pip install -q torch torchvision torchaudio
	pip install transformers>=4.49
	```
	Then run the code:
	```python
	from transformers import AutoProcessor, AutoModel
	from PIL import Image
	import torch

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model_name = "ibm-granite/granite-vision-embedding-3.3-2b"
	model = AutoModel.from_pretrained(model_name, trust_remote_code=True,torch_dtype=torch.float16).to(device).eval()
	processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True)


	# ─────────────────────────────────────────────
	# Inputs: Image + Text
	# ─────────────────────────────────────────────
	image_url = "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg"
	print("\nFetching image...")
	image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")

	text = "A photo of a tiger"
	print(f"Image and text inputs ready.")

	# Process both inputs
	print("Processing inputs...")
	image_inputs = processor.process_images([image])
	text_inputs = processor.process_queries([text])

	# Move to correct device
	image_inputs = {k: v.to(device) for k, v in image_inputs.items()}
	text_inputs = {k: v.to(device) for k, v in text_inputs.items()}

	# ─────────────────────────────────────────────
	# Run Inference
	# ─────────────────────────────────────────────
	with torch.no_grad():
	print("🔍 Getting image embedding...")
	img_emb = model(**image_inputs)

	print("✍️ Getting text embedding...")
	txt_emb = model(**text_inputs)

	# ─────────────────────────────────────────────
	# Score the similarity
	# ─────────────────────────────────────────────
	print("Scoring similarity...")
	similarity = processor.score(txt_emb, img_emb, batch_size=1, device=device)

	print("\n" + "=" * 50)
	print(f"📊 Similarity between image and text: {similarity.item():.4f}")
	print("=" * 50)

	```
	### Use granite-vision-embedding-3.3-2b for MM RAG
	For an example of MM RAG using col-granite-visionrefer to [this notebook](......).

	Model Architecture:
	We built our model upon [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b) with additional projection layer.
	Training Data:
	The model was trained on a random subset from DOCFM. DOCFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
	documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance)
	reports. For each image in the dataset, Pseudo-questions were generated using Pixtral12B VLM.
	Infrastructure:
	We train granite-vision-embedding-3.3-2b on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs.
	Ethical Considerations and Limitations:
	The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. col-granite-vision-1.0-2b is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses.
	Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use col-granite-vision-1.0-2b with ethical intentions and in a responsible way.
	Resources
	- :page_facing_up: Granite Vision technical report [here](https://arxiv.org/abs/2502.09927)
	- :star:️ Learn about the latest updates with Granite: https://www.ibm.com/granite
	- :rocket: Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
	- :bulb: Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources