SentenceTransformer based on sentence-transformers/all-mpnet-base-v2

This is a sentence-transformers model specifically trained for job title matching and similarity. It's finetuned from sentence-transformers/all-mpnet-base-v2 on a large dataset of job titles and their associated skills/requirements. The model maps job titles and descriptions to a 1024-dimensional dense vector space and can be used for semantic job title matching, job similarity search, and related HR/recruitment tasks.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/all-mpnet-base-v2
Maximum Sequence Length: 64 tokens
Output Dimensionality: 1024 tokens
Similarity Function: Cosine Similarity
Training Dataset: 5.5M+ job title - skills pairs
Primary Use Case: Job title matching and similarity
Performance: Achieves 0.6457 MAP on TalentCLEF benchmark

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 64, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Asym(
    (anchor-0): Dense({'in_features': 768, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
    (positive-0): Dense({'in_features': 768, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
  )
)

Usage

Direct Usage (Sentence Transformers)

First install the required packages:

pip install -U sentence-transformers

Then you can load and use the model with the following code:

import torch
import numpy as np
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import batch_to_device, cos_sim

# Load the model
model = SentenceTransformer("TechWolf/JobBERT-v2")

def encode_batch(jobbert_model, texts):
    features = jobbert_model.tokenize(texts)
    features = batch_to_device(features, jobbert_model.device)
    features["text_keys"] = ["anchor"]
    with torch.no_grad():
        out_features = jobbert_model.forward(features)
    return out_features["sentence_embedding"].cpu().numpy()

def encode(jobbert_model, texts, batch_size: int = 8):
    # Sort texts by length and keep track of original indices
    sorted_indices = np.argsort([len(text) for text in texts])
    sorted_texts = [texts[i] for i in sorted_indices]
    
    embeddings = []
    
    # Encode in batches
    for i in tqdm(range(0, len(sorted_texts), batch_size)):
        batch = sorted_texts[i:i+batch_size]
        embeddings.append(encode_batch(jobbert_model, batch))
    
    # Concatenate embeddings and reorder to original indices
    sorted_embeddings = np.concatenate(embeddings)
    original_order = np.argsort(sorted_indices)
    return sorted_embeddings[original_order]

# Example usage
job_titles = [
    'Software Engineer',
    'Senior Software Developer',
    'Product Manager',
    'Data Scientist'
]

# Get embeddings
embeddings = encode(model, job_titles)

# Calculate cosine similarity matrix
similarities = cos_sim(embeddings, embeddings)
print(similarities)

The output will be a similarity matrix where each value represents the cosine similarity between two job titles:

tensor([[1.0000, 0.8723, 0.4821, 0.5447],
        [0.8723, 1.0000, 0.4822, 0.5019],
        [0.4821, 0.4822, 1.0000, 0.4328],
        [0.5447, 0.5019, 0.4328, 1.0000]])

In this example:

The diagonal values are 1.0000 (perfect similarity with itself)
'Software Engineer' and 'Senior Software Developer' have high similarity (0.8723)
'Product Manager' and 'Data Scientist' show lower similarity with other roles
All values range between 0 and 1, where higher values indicate greater similarity

Example Use Cases

Job Title Matching: Find similar job titles for standardization or matching
Job Search: Match job seekers with relevant positions based on title similarity
HR Analytics: Analyze job title patterns and similarities across organizations
Talent Management: Identify similar roles for career development and succession planning

Training Details

Training Dataset

generator

Dataset: 5.5M+ job title pairs
Format: Anchor job titles paired with related skills/requirements
Training objective: Learn semantic similarity between job titles and their associated skills
Loss: CachedMultipleNegativesRankingLoss with cosine similarity

Training Hyperparameters

Batch Size: 2048
Learning Rate: 5e-05
Epochs: 1
FP16 Training: Enabled
Optimizer: AdamW

Framework Versions

Python: 3.9.19
Sentence Transformers: 3.1.0
Transformers: 4.44.2
PyTorch: 2.4.1+cu118
Accelerate: 0.34.2
Datasets: 3.0.0
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

TechWolf
/

JobBERT-v2