--- license: mit language: - en tags: - text-embeddings - telecom - domain-adaptation - triplet-loss - transformer - semantic-search - sentence-transformers - domain-specific - contrastive-learning - simcse - bio-bert - don’t-stop-pretraining metrics: - name: Telecom Triplet Score type: accuracy value: 0.9380 verified: false - name: Average MTEB Score type: accuracy value: 0.825 verified: false - name: Average STS Score type: spearman value: 82.19 verified: false - name: AllNLI Triplet Score type: accuracy value: 0.6150 verified: false base_model: - Alibaba-NLP/gte-Qwen2-1.5B-instruct model-index: - name: T-VEC results: - task: type: text-embedding name: Telecom Triplet Benchmark dataset: type: custom name: Telecom Triplet Benchmark metrics: - name: Telecom Triplet Score type: accuracy value: 0.9380 verified: false - task: type: text-embedding name: MTEB Benchmark dataset: type: openai_humaneval name: MTEB Benchmark metrics: - name: Average MTEB Score type: accuracy value: 0.825 verified: false - task: type: text-embedding name: STS Benchmark dataset: type: openai_humaneval name: STS Benchmark metrics: - name: Average STS Score type: spearman value: 82.19 verified: false - task: type: text-embedding name: AllNLI Triplet dataset: type: openai_humaneval name: AllNLI Triplet metrics: - name: Triplet Score type: accuracy value: 0.6150 verified: false extra_gated_prompt: "Please provide answers to the below questions to gain access to the model" extra_gated_fields: Company: text Full Name: text Email: text I want to use this model for: type: select options: - Research - Education - Commercial - label: Other value: other --- # T-VEC: A Telecom-Specific Text Embedding Model ## Overview **T-VEC (Telecom Vectorization Model)** is a domain-adapted text embedding model developed by NetoAI and fine-tuned from [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct). Using a deeply supervised triplet-loss approach, T-VEC learns rich semantic representations tailored to telecom use cases, achieving state-of-the-art results on custom and standard benchmarks. ## Model Details - **Model Name**: T-VEC - **Developer**: [NetoAI](https://www.netoai.ai) - **Base Model**: Alibaba-NLP/gte-Qwen2-1.5B-instruct - **Parameters**: 1.5 Billion - **Embedding Dimension**: 1536 - **Max Input Tokens**: 32,000 - **Languages**: Multilingual (optimized for English) - **License**: MIT - **Tokenizer**: Custom telecom-specific tokenizer (open-source) ## Intended Uses - Semantic search over telecom documents (3GPP standards, vendor manuals) - Fault log analysis for root-cause detection - Telecom-specific chatbots and Q&A systems - Regulatory compliance analysis and semantic auditing ## Training Details - **Objective**: Triplet loss using cosine similarity - **Dataset**: 100k+ telecom triplets curated by domain experts over 3 months - **Layer Modification**: 338 transformer layers fine-tuned - **Avg. L2 Norm Weight Change**: 0.7735 - **Enhancements**: Telecom-specific tokenizer and query-aware anchor strategies ## Evaluation Results | Benchmark | Metric | Score | |-----------------------------|----------------------|--------| | Telecom Triplet Benchmark | Accuracy | 0.9380 | | MTEB Benchmark | Accuracy | 0.825 | | STS Benchmark | Spearman Correlation | 82.19 | | AllNLI Triplet | Accuracy | 0.6150 | T-VEC significantly outperforms both its base model and other strong general-purpose models on telecom-specific benchmarks, while still retaining competitive general performance. | Model | ArguAna | SciDocsRR | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | |--------------------------------|---------|--------------|-------------|------------|------------|------------|------------|--------------| | gte‑Qwen2‑1.5B‑instruct | 0.62335 | 0.81558 | 0.72805 | 0.84699 | 0.78803 | 0.87450 | 0.84938 | 0.85379 | | T‑VEC | 0.61150 | 0.83970 | 0.80320 | 0.88220 | 0.82750 | 0.88260 | 0.84780 | 0.88050 | | all‑MiniLM‑L6‑v2 | 0.50167 | 0.87119 | 0.72369 | 0.80603 | 0.75589 | 0.85390 | 0.78989 | 0.82032 | | all‑mpnet‑base‑v2 | 0.46521 | 0.88654 | 0.72634 | 0.83485 | 0.78000 | 0.85663 | 0.80030 | 0.83422 | | bge‑base‑en‑v1.5 | 0.63616 | 0.87494 | 0.78028 | 0.84184 | 0.82273 | 0.87957 | 0.85474 | 0.86418 | | e5‑base‑v2 | 0.51604 | 0.82834 | 0.73489 | 0.82997 | 0.80446 | 0.88181 | 0.83659 | 0.85480 | | jina‑embeddings‑v2‑base‑en | 0.44152 | 0.83106 | 0.74278 | 0.84177 | 0.78808 | 0.87553 | 0.85347 | 0.84842 | | instructor‑xl | 0.54884 | 0.79538 | 0.74085 | 0.85046 | 0.80318 | 0.88359 | 0.83784 | 0.83048 | | gte‑base | 0.57151 | 0.87083 | 0.75707 | 0.85729 | 0.81510 | 0.88810 | 0.83824 | 0.85738 | | multilingual‑e5‑base | 0.47829 | 0.80392 | 0.77933 | 0.76890 | 0.77535 | 0.88373 | 0.82699 | 0.84201 | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66fa4fb0ec6983f03c2b1ca2/oIX2bc76Er4TDd5eZCb_C.png) ## Limitations - Reduced performance on non-domain tasks (e.g., AllNLI) due to specialization - Large size may impact deployment on edge devices - May miss recent telecom developments outside the training set ## Ethical Considerations - Use in critical telecom systems should be validated by domain experts - May reflect terminology biases from dominant vendors in the dataset - Open licensing (MIT) supports transparency and community contributions ## Usage ### Installation ```bash pip install transformers ``` ### Load and Run ```python from transformers import AutoModel, AutoTokenizer import torch model = AutoModel.from_pretrained("netoai/t-vec") tokenizer = AutoTokenizer.from_pretrained("netoai/t-vec") texts = ["5G NR architecture", "LTE handover", "Core network functions"] inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=32000) emb = model(**inputs).last_hidden_state.mean(dim=1) cos_sim = torch.nn.functional.cosine_similarity(emb[0:1], emb[1:], dim=1) print(cos_sim) ``` ## Citation ```bibtex @article{ethiraj2025tvec, title={T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning}, author={Ethiraj, Vignesh and Menon, Sidhanth and Vijay, Divya}, journal={arXiv preprint}, year={2025}, url={https://arxiv.org/abs/2504.16460} } ``` ## References - Ethiraj, V., Menon, S., & Vijay, D. (2025). T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning. arXiv:2504.16460. - Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR, 2015. - Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” arXiv:1703.07737, 2017. - Reimers, N., Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019. - Gao, T., Yao, X., Chen, D. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv:2104.08821, 2021. - Gururangan, S., et al. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” ACL, 2020. - Lee, J., Yoon, W., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics, 2020. - Sahu, S. K., Maheshwari, A. “Automatic extraction of telecom network events from log messages.” IEEE ICC, 2018. - Wang, X., Li, Y., Han, J. “Log2Vec: A Deep Embedding Model for Network Log Analysis.” IEEE/IFIP DSN, 2021. ## Contact - For questions or contributions, visit https://www.netoai.ai. ---