1995 381 1117

Tom Aarsen

tomaarsen

https://linkedin.com/in/tomaarsen

AI & ML interests

NLP: text embeddings, information retrieval, named entity recognition, few-shot text classification

Recent Activity

upvoted an article about 2 hours ago

How to Choose the Best Open Source LLM for Your Project in 2025

posted an update about 5 hours ago

ModernBERT goes MULTILINGUAL! One of the most requested models I've seen, The Johns Hopkins University's CLSP has trained state-of-the-art massively multilingual encoders using the ModernBERT architecture: mmBERT. Model details: - 2 model sizes: - https://huggingface.co/jhu-clsp/mmBERT-small - https://huggingface.co/jhu-clsp/mmBERT-base - Uses the ModernBERT architecture, but with the Gemma2 multilingual tokenizer (so: flash attention, alternating global/local attention, unpadding/sequence packing, etc.) - Maximum sequence length of 8192 tokens, on the high end for encoders - Trained on 1833 languages using DCLM, FineWeb2, and many more sources - 3 training phases: 2.3T tokens pretraining on 60 languages, 600B tokens mid-training on 110 languages, and 100B tokens decay training on all 1833 languages. - Both models are MIT Licensed, and the full datasets and intermediary checkpoints are also publicly released Evaluation details: - Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning) - Consistently outperforms equivalently sized models on all Multilingual tasks (XTREME, classification, MTEB v2 Multilingual after finetuning) - In short: beats commonly used multilingual base models like mDistilBERT, XLM-R (multilingual RoBERTa), multilingual MiniLM, etc. - Additionally: the ModernBERT-based mmBERT is much faster than the alternatives due to its architectural benefits. Easily up to 2x throughput in common scenarios. Check out the full blogpost with more details. It's super dense & gets straight to the point: https://huggingface.co/blog/mmbert Based on these results, mmBERT should be the new go-to multilingual encoder base models at 300M and below. Do note that the mmBERT models are "base" models, i.e. they're currently only trained to perform Mask Filling. They'll need to be finetuned for downstream tasks like semantic search, classification, clustering, etc.

upvoted a paper about 6 hours ago

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

View all activity

Organizations

upvoted an article about 2 hours ago

Article

How to Choose the Best Open Source LLM for Your Project in 2025

•

about 6 hours ago

• 10

posted an update about 5 hours ago

Post

ModernBERT goes MULTILINGUAL! One of the most requested models I've seen, The Johns Hopkins University's CLSP has trained state-of-the-art massively multilingual encoders using the ModernBERT architecture: mmBERT.

Model details:
- 2 model sizes:
- jhu-clsp/mmBERT-small
- jhu-clsp/mmBERT-base
- Uses the ModernBERT architecture, but with the Gemma2 multilingual tokenizer (so: flash attention, alternating global/local attention, unpadding/sequence packing, etc.)
- Maximum sequence length of 8192 tokens, on the high end for encoders
- Trained on 1833 languages using DCLM, FineWeb2, and many more sources
- 3 training phases: 2.3T tokens pretraining on 60 languages, 600B tokens mid-training on 110 languages, and 100B tokens decay training on all 1833 languages.
- Both models are MIT Licensed, and the full datasets and intermediary checkpoints are also publicly released

Evaluation details:
- Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning)
- Consistently outperforms equivalently sized models on all Multilingual tasks (XTREME, classification, MTEB v2 Multilingual after finetuning)
- In short: beats commonly used multilingual base models like mDistilBERT, XLM-R (multilingual RoBERTa), multilingual MiniLM, etc.
- Additionally: the ModernBERT-based mmBERT is much faster than the alternatives due to its architectural benefits. Easily up to 2x throughput in common scenarios.

Check out the full blogpost with more details. It's super dense & gets straight to the point: https://huggingface.co/blog/mmbert

Based on these results, mmBERT should be the new go-to multilingual encoder base models at 300M and below. Do note that the mmBERT models are "base" models, i.e. they're currently only trained to perform Mask Filling. They'll need to be finetuned for downstream tasks like semantic search, classification, clustering, etc.