AraGemma-Embedding-300m

Model Page: AraGemma-Embedding (Hugging Face)

Authors: Google DeepMind (base model), fine-tuned by Omartificial-Intelligence-Space

Find More About: Arabic Semantic Embeddings Models

Simple RAG and Other NLP Tasks Example:

Model Overview

AraGemma-Embedding-300m is a fine-tuned version of EmbeddingGemma-300M, optimized for Arabic semantic understanding.
This model was fine-tuned using 1 million Arabic triplet pairs (anchor, positive, negative) with Matryoshka Representation Learning (MRL) to enhance semantic similarity, clustering, classification, and retrieval for Arabic texts.

It builds on Google’s Gemma 3 research, making it lightweight, efficient, and deployable on-device (mobile, laptops, desktops) while achieving state-of-the-art Arabic semantic embedding performance.

Model Information

Input

Text string (Arabic or multilingual)
Maximum context length: 2048 tokens

Output

Dense vector representation of size 768
Supports MRL truncation to 512, 256, or 128 dimensions with re-normalization

Performance

Benchmark Results

Significant improvements show stronger semantic Arabic understanding.

Performance with other Arabic Embeddings

Model	Dim	# Params.	STS17	STS22-v2	Average
Arabic-Triplet-Matryoshka-V2	768	135M	85	64	75
Arabert-all-nli-triplet-Matryoshka	768	135M	83	64	74
AraGemma-Embedding-300m	768	303M	84	62	73
GATE-AraBert-V1	767	135M	83	63	73
Marbert-all-nli-triplet-Matryoshka	768	163M	82	61	72
Arabic-labse-Matryoshka	768	471M	82	61	72
AraEuroBert-Small	768	210M	80	61	71
E5-all-nli-triplet-Matryoshka	384	278M	80	60	70
text-embedding-3-large	3072	-	81	59	70
Arabic-all-nli-triplet-Matryoshka	768	135M	82	54	68
AraEuroBert-Mid	1151	610M	83	53	68
paraphrase-multilingual-mpnet-base-v2	768	135M	79	55	67
AraEuroBert-Large	2304	2.1B	79	55	67
text-embedding-ada-002	1536	-	71	62	66
text-embedding-3-small	1536	-	72	57	65

Usage

This model is compatible with Sentence Transformers and Hugging Face Transformers.

from sentence_transformers import SentenceTransformer

# Load the Arabic-optimized embedding model
model = SentenceTransformer("Omartificial-Intelligence-Space/AraGemma-Embedding-300m")

# Example: Arabic semantic similarity
query = "ما هو الكوكب الأحمر؟"
documents = [
    "الزهرة تشبه الأرض في الحجم والقرب.",
    "المريخ يسمى بالكوكب الأحمر بسبب لونه المميز.",
    "المشتري أكبر كواكب المجموعة الشمسية.",
    "زحل يتميز بحلقاته الشهيرة."
]

query_embedding = model.encode(query)
doc_embeddings = model.encode(documents)

# Compute cosine similarities
from torch import cosine_similarity
import torch

query_tensor = torch.tensor(query_embedding)
doc_tensors = torch.tensor(doc_embeddings)
similarities = cosine_similarity(query_tensor.unsqueeze(0), doc_tensors)

print(similarities)

Applications

Semantic Chunking for RAG (Retrieval-Augmented Generation)
Semantic Search & Retrieval (Arabic focus)
Clustering and Classification of Arabic documents
Cross-lingual retrieval (multilingual data supported)

Limitations

Embedding activations do not support float16 – use float32 or bfloat16.

Citation

If you use this model in your work, please cite:

@misc{AraGemmaEmbedding2025,
  title={AraGemma-Embedding: Fine-tuned EmbeddingGemma for Arabic Semantic Understanding},
  author={Omartificial-Intelligence-Space},
  year={2025},
  url={https://huggingface.co/Omartificial-Intelligence-Space/AraGemma-Embedding-300m}
}

Omartificial-Intelligence-Space
/

AraGemma-Embedding-300m

You need to agree to share your contact information to access this model