arxiv:2603.10913

LLM2Vec-Gen: Generative Embeddings from Large Language Models

Published on Mar 11

· Submitted by

Parishad BehnamGhader on Mar 12

McGill NLP Group

Upvote

Authors:

Parishad BehnamGhader ,

Vaibhav Adlakha ,

Fabian David Schmidt ,

Nicolas Chapados ,

Marius Mosbach ,

Siva Reddy

Abstract

LLM2Vec-Gen introduces a self-supervised method for text embedding that represents model responses through trainable special tokens, achieving superior performance on MTEB while reducing harmful content and improving reasoning capabilities.

AI-generated summary

LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.

View arXiv page View PDF Project page GitHub 45 Add to collection

Community

parishadbehnam

Paper author Paper submitter 5 days ago

LLM2Vec-Gen is a recipe to train interpretable, generative embeddings that encode the potential answer of an LLM to a query rather than the query itself.
Specifically, we optimize additional trainable special tokens to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities, such as safety alignment and reasoning, to embedding tasks. LLM2Vec-Gen embeddings are interpretable and can be decoded into text to reveal their semantic content.

llm2vecgen_main_figure

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

1 day ago

honestly, the bit that sticks with me is the fixed-length embedding built from ten thought tokens plus ten compression tokens injected into a frozen LLM, so the model’s own upcoming reply becomes the signal we compress. i’m curious how stable that reconstruction objective is in practice, since it relies on the model’s own generation, and i wonder how much the embedding quality shifts with generation hyperparameters or when the reply goes off topic. the combo with an unsupervised embedding teacher to distill a target seems to be the real driver for aligning to downstream tasks, but an ablation on the teacher vs reconstruction would be nice. btw the arxivlens breakdown helped me parse the method details, there’s a solid walkthrough on arxivlens that covers this well, which makes section 3 click a lot faster, and the link is https://arxivlens.com/PaperView/Details/llm2vec-gen-generative-embeddings-from-large-language-models-4344-7aa67f0b