---
license: mit
language:
  - en
tags:
  - text-embeddings
  - telecom
  - domain-adaptation
  - triplet-loss
  - transformer
  - semantic-search
  - sentence-transformers
  - domain-specific
  - contrastive-learning
  - simcse
  - bio-bert
  - don’t-stop-pretraining

metrics:
  - name: Telecom Triplet Score
    type: accuracy
    value: 0.9380
    verified: false
  - name: Average MTEB Score
    type: accuracy
    value: 0.825
    verified: false
  - name: Average STS Score
    type: spearman
    value: 82.19
    verified: false
  - name: AllNLI Triplet Score
    type: accuracy
    value: 0.6150
    verified: false
base_model:
  - Alibaba-NLP/gte-Qwen2-1.5B-instruct
model-index:
  - name: T-VEC
    results:
      - task:
          type: text-embedding
          name: Telecom Triplet Benchmark
        dataset:
          type: custom
          name: Telecom Triplet Benchmark
        metrics:
          - name: Telecom Triplet Score
            type: accuracy
            value: 0.9380
            verified: false
      - task:
          type: text-embedding
          name: MTEB Benchmark
        dataset:
          type: openai_humaneval
          name: MTEB Benchmark
        metrics:
          - name: Average MTEB Score
            type: accuracy
            value: 0.825
            verified: false
      - task:
          type: text-embedding
          name: STS Benchmark
        dataset:
          type: openai_humaneval
          name: STS Benchmark
        metrics:
          - name: Average STS Score
            type: spearman
            value: 82.19
            verified: false
      - task:
          type: text-embedding
          name: AllNLI Triplet
        dataset:
          type: openai_humaneval
          name: AllNLI Triplet
        metrics:
          - name: Triplet Score
            type: accuracy
            value: 0.6150
            verified: false

extra_gated_prompt: "Please provide answers to the below questions to gain access to the model"
extra_gated_fields:
  Company: text
  Full Name: text
  Email: text
  I want to use this model for:
    type: select
    options: 
      - Research
      - Education
      - Commercial
      - label: Other
        value: other
---


# T-VEC: A Telecom-Specific Text Embedding Model

## Overview

**T-VEC (Telecom Vectorization Model)** is a domain-adapted text embedding model developed by NetoAI and fine-tuned from [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct). Using a deeply supervised triplet-loss approach, T-VEC learns rich semantic representations tailored to telecom use cases, achieving state-of-the-art results on custom and standard benchmarks.

## Model Details

- **Model Name**: T-VEC  
- **Developer**: [NetoAI](https://www.netoai.ai)  
- **Base Model**: Alibaba-NLP/gte-Qwen2-1.5B-instruct  
- **Parameters**: 1.5 Billion  
- **Embedding Dimension**: 1536  
- **Max Input Tokens**: 32,000  
- **Languages**: Multilingual (optimized for English)  
- **License**: MIT  
- **Tokenizer**: Custom telecom-specific tokenizer (open-source)

## Intended Uses

- Semantic search over telecom documents (3GPP standards, vendor manuals)  
- Fault log analysis for root-cause detection  
- Telecom-specific chatbots and Q&A systems  
- Regulatory compliance analysis and semantic auditing

## Training Details

- **Objective**: Triplet loss using cosine similarity  
- **Dataset**: 100k+ telecom triplets curated by domain experts over 3 months  
- **Layer Modification**: 338 transformer layers fine-tuned  
- **Avg. L2 Norm Weight Change**: 0.7735  
- **Enhancements**: Telecom-specific tokenizer and query-aware anchor strategies

## Evaluation Results

| Benchmark                   | Metric               | Score  |
|-----------------------------|----------------------|--------|
| Telecom Triplet Benchmark   | Accuracy             | 0.9380 |
| MTEB Benchmark              | Accuracy             | 0.825  |
| STS Benchmark               | Spearman Correlation | 82.19  |
| AllNLI Triplet              | Accuracy             | 0.6150 |

T-VEC significantly outperforms both its base model and other strong general-purpose models on telecom-specific benchmarks, while still retaining competitive general performance.


| Model                          | ArguAna | SciDocsRR    | STS12       | STS13      | STS14      | STS15      | STS16      | STSBenchmark |
|--------------------------------|---------|--------------|-------------|------------|------------|------------|------------|--------------|
| gte‑Qwen2‑1.5B‑instruct        | 0.62335 | 0.81558      | 0.72805     | 0.84699    | 0.78803    | 0.87450    | 0.84938    | 0.85379      |
| T‑VEC                          | 0.61150 | 0.83970      | 0.80320     | 0.88220    | 0.82750    | 0.88260    | 0.84780    | 0.88050      |
| all‑MiniLM‑L6‑v2               | 0.50167 | 0.87119      | 0.72369     | 0.80603    | 0.75589    | 0.85390    | 0.78989    | 0.82032      |
| all‑mpnet‑base‑v2              | 0.46521 | 0.88654      | 0.72634     | 0.83485    | 0.78000    | 0.85663    | 0.80030    | 0.83422      |
| bge‑base‑en‑v1.5               | 0.63616 | 0.87494      | 0.78028     | 0.84184    | 0.82273    | 0.87957    | 0.85474    | 0.86418      |
| e5‑base‑v2                     | 0.51604 | 0.82834      | 0.73489     | 0.82997    | 0.80446    | 0.88181    | 0.83659    | 0.85480      |
| jina‑embeddings‑v2‑base‑en     | 0.44152 | 0.83106      | 0.74278     | 0.84177    | 0.78808    | 0.87553    | 0.85347    | 0.84842      |
| instructor‑xl                  | 0.54884 | 0.79538      | 0.74085     | 0.85046    | 0.80318    | 0.88359    | 0.83784    | 0.83048      |
| gte‑base                       | 0.57151 | 0.87083      | 0.75707     | 0.85729    | 0.81510    | 0.88810    | 0.83824    | 0.85738      |
| multilingual‑e5‑base           | 0.47829 | 0.80392      | 0.77933     | 0.76890    | 0.77535    | 0.88373    | 0.82699    | 0.84201      |


![image/png](https://cdn-uploads.huggingface.co/production/uploads/66fa4fb0ec6983f03c2b1ca2/oIX2bc76Er4TDd5eZCb_C.png)


## Limitations

- Reduced performance on non-domain tasks (e.g., AllNLI) due to specialization  
- Large size may impact deployment on edge devices  
- May miss recent telecom developments outside the training set

## Ethical Considerations

- Use in critical telecom systems should be validated by domain experts  
- May reflect terminology biases from dominant vendors in the dataset  
- Open licensing (MIT) supports transparency and community contributions

## Usage

### Installation

```bash
pip install transformers
```

### Load and Run

```python
from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("netoai/t-vec")
tokenizer = AutoTokenizer.from_pretrained("netoai/t-vec")

texts = ["5G NR architecture", "LTE handover", "Core network functions"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=32000)
emb = model(**inputs).last_hidden_state.mean(dim=1)

cos_sim = torch.nn.functional.cosine_similarity(emb[0:1], emb[1:], dim=1)
print(cos_sim)
```

## Citation

```bibtex
@article{ethiraj2025tvec,
  title={T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning},
  author={Ethiraj, Vignesh and Menon, Sidhanth and Vijay, Divya},
  journal={arXiv preprint},
  year={2025},
  url={https://arxiv.org/abs/2504.16460}
}
```

## References
- Ethiraj, V., Menon, S., & Vijay, D. (2025). T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning. arXiv:2504.16460. 
- Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR, 2015.  
- Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” arXiv:1703.07737, 2017.  
- Reimers, N., Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019.  
- Gao, T., Yao, X., Chen, D. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv:2104.08821, 2021.  
- Gururangan, S., et al. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” ACL, 2020.  
- Lee, J., Yoon, W., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics, 2020.  
- Sahu, S. K., Maheshwari, A. “Automatic extraction of telecom network events from log messages.” IEEE ICC, 2018.  
- Wang, X., Li, Y., Han, J. “Log2Vec: A Deep Embedding Model for Network Log Analysis.” IEEE/IFIP DSN, 2021.


## Contact
- For questions or contributions, visit https://www.netoai.ai.
---