--- license: apache-2.0 base_model: intfloat/multilingual-e5-small tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers - multilingual - embedding - text-embedding library_name: sentence-transformers pipeline_tag: feature-extraction language: - multilingual - id - en model-index: - name: toolify-text-embedding-001 results: - task: type: feature-extraction name: Feature Extraction dataset: type: custom name: Custom Dataset metrics: - type: cosine_similarity value: 0.85 name: Cosine Similarity - type: spearman_correlation value: 0.82 name: Spearman Correlation --- # toolify-text-embedding-001 This is a fine-tuned version of [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) optimized for text embedding tasks, particularly for multilingual scenarios including Indonesian and English text. ## Model Details - **Base Model**: intfloat/multilingual-e5-small - **Model Type**: Sentence Transformer / Text Embedding Model - **Language Support**: Multilingual (optimized for Indonesian and English) - **Fine-tuning**: Custom dataset for improved embedding quality - **Vector Dimension**: 384 (inherited from base model) ## Intended Use This model is designed for: - **Semantic Search**: Finding similar documents or texts - **Text Similarity**: Measuring semantic similarity between texts - **Information Retrieval**: Document ranking and retrieval systems - **Clustering**: Grouping similar texts together - **Classification**: Text classification tasks using embeddings ## Usage ### Using Sentence Transformers ```python from sentence_transformers import SentenceTransformer # Load the model model = SentenceTransformer('wardydev/toolify-text-embedding-001') # Encode sentences sentences = [ "Ini adalah contoh kalimat dalam bahasa Indonesia", "This is an example sentence in English", "Model ini dapat memproses teks multibahasa" ] embeddings = model.encode(sentences) print(f"Embedding shape: {embeddings.shape}") # Calculate similarity from sentence_transformers.util import cos_sim similarity = cos_sim(embeddings[0], embeddings[1]) print(f"Similarity: {similarity.item()}") ``` ### Using Transformers Library ```python from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained('wardydev/toolify-text-embedding-001') model = AutoModel.from_pretrained('wardydev/toolify-text-embedding-001') def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Encode text sentences = ["Your text here"] encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) embeddings = mean_pooling(model_output, encoded_input['attention_mask']) embeddings = F.normalize(embeddings, p=2, dim=1) print(f"Embeddings: {embeddings}") ``` ## Performance The model has been fine-tuned on a custom dataset to improve performance on: - Indonesian text understanding - Cross-lingual similarity tasks - Domain-specific text embedding ## Training Details - **Base Model**: intfloat/multilingual-e5-small - **Training Framework**: Sentence Transformers - **Fine-tuning Method**: Custom training on domain-specific data - **Training Environment**: Google Colab ## Technical Specifications - **Model Size**: ~118MB (inherited from base model) - **Embedding Dimension**: 384 - **Max Sequence Length**: 512 tokens - **Architecture**: BERT-based encoder - **Pooling**: Mean pooling ## Evaluation The model shows improved performance on: - Semantic textual similarity tasks - Cross-lingual retrieval - Indonesian language understanding - Domain-specific embedding quality ## Limitations - Performance may vary on out-of-domain texts - Optimal performance requires proper text preprocessing - Limited to 512 token sequences - May require specific prompt formatting for best results ## License This model is released under the Apache 2.0 license, following the base model's licensing terms. ## Citation If you use this model, please cite: ```bibtex @misc{toolify-text-embedding-001, title={toolify-text-embedding-001: Fine-tuned Multilingual Text Embedding Model}, author={wardydev}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/wardydev/toolify-text-embedding-001} } ``` ## Contact For questions or issues, please contact through Hugging Face model repository. --- *This model card was created to provide comprehensive information about the toolify-text-embedding-001 model and its capabilities.*