Ops-MM-embedding-v1-2B
Ops-MM-embedding-v1-2B is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.
Key Features
Unified Multimodal Embeddings
- Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.
High Performance on MMEB
- Achieves SOTA results among models of similar scale on MMEB-V2 and MMEB-Image benchmark (until 2025-07-03).
Multilingual Capabilities
- The larger variant (Ops-MM-embedding-v1-7B) achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.
Training data
MMEB-train, CC-3M, colpali training set.
Performance
MMEB-V2
| Model |
Model Size (B) |
Overall |
Image-Overall |
Video-Overall |
Visdoc-Overall |
| seed-1.6-embedding |
unknown |
71.27 |
77.78 |
55.34 |
73.44 |
| Ops-MM-embedding-v1-7B |
8.29 |
67.61 |
72.72 |
53.76 |
70.34 |
| Ops-MM-embedding-v1-2B |
2.21 |
63.44 |
69.03 |
47.56 |
66.96 |
| VLM2Vec-V2.0-Qwen2VL-2B |
2.21 |
58.02 |
64.85 |
34.85 |
65.36 |
| gme-Qwen2-VL-7B-Instruct |
8.29 |
57.83 |
55.95 |
38.43 |
75.18 |
| gme-Qwen2-VL-2B-Instruct |
2.21 |
54.08 |
51.89 |
33.64 |
72.71 |
MMEB-Image
The table below compares performance on MMEB-Image benchmark among models of similar size.
| Model |
Model Size (B) |
Image-Overall |
I-CLS |
I-QA |
I-RET |
I-VG |
| Ops-MM-embedding-v1-2B |
2.21 |
69.03 |
68.07 |
65.11 |
69.17 |
80.85 |
| B3_Qwen2_2B |
2.21 |
68.1 |
67 |
61.19 |
70.85 |
79.88 |
| LLaVE-2B |
1.95 |
65.2 |
62.1 |
60.2 |
65.2 |
84.9 |
ViDoRe-v2
| Model |
Avg |
ESG Restaurant Human |
MIT Bio Multi. |
Econ Macro Multi. |
ESG Restaurant Synth. Multi. |
| gme-7B |
55.61 |
63.37 |
49.49 |
54.21 |
55.38 |
| seed 1.6 embedding |
56.57 |
63.3 |
57.14 |
53.85 |
51.99 |
| Ops-MM-embedding-v1-7B |
59.59 |
66.27 |
54.34 |
60.92 |
56.82 |
| Ops-MM-embedding-v1-2B |
53.18 |
58.57 |
52.87 |
47.89 |
53.39 |
Usage
from ops_mm_embedding_v1 import OpsMMEmbeddingV1, fetch_image
model = OpsMMEmbeddingV1(
"OpenSearch-AI/Ops-MM-embedding-v1-2B",
device="cuda",
attn_implementation="flash_attention_2"
)
t2i_prompt = "Find an image that matches the given text."
texts = [
"The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
"Alibaba office.",
"Alibaba office.",
]
images = [
"https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg",
"https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Alibaba_Binjiang_Park.jpg/1024px-Alibaba_Binjiang_Park.jpg"
]
images = [fetch_image(image) for image in images]
text_embeddings = model.get_text_embeddings(texts)
image_embeddings = model.get_image_embeddings(images)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())
text_with_image_embeddings = model.get_fused_embeddings(texts=texts, images=images, instruction=t2i_prompt)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())
multi_images = [
[images[0]],
[images[1], images[2]],
]
multi_image_embeddings = model.get_image_embeddings(multi_images)
print('Multi-image embeddings', (multi_image_embeddings @ multi_image_embeddings.T).tolist())