sigridjineth's picture
Update README.md
470db40 verified
|
raw
history blame
8.96 kB
---
library_name: transformers
license: cc-by-nc-sa-4.0
pipeline_tag: text-ranking
tags:
- reranker
- sequence-classification
- qwen3
- multilingual
- bfloat16
- 32k
base_model: ContextualAI/ctxl-rerank-v2-instruct-multilingual-1b
model_type: qwen3
---
# Contextual AI Reranker v2 1B — **SequenceClassification (single-logit) Converted Model**
This repository contains a **drop-in SequenceClassification** version of the original **ContextualAI/ctxl-rerank-v2-instruct-multilingual-1b**.
It exposes a **single logit per input** (one score) that is **numerically equivalent** to the original model’s last-token **`vocab_id=0`** logit (`next_logits[:, 0]`). That means you can use standard **text-classification/CrossEncoder** tooling for fast, simple reranking—without custom logits processors—while preserving the original scores and ranking order.
> **What changed?** We copy the LM head’s **row 0** vector into a 1-logit classification head (`score.weight ← lm_head.weight[0]`), set bias to 0 (or the matching bias row if present), and keep tokenizer/padding behavior aligned with the original. Result: `SequenceClassification` output ≡ original `next_logits[:, 0]`.
---
## Highlights
* **Parity with the original**: The score from this model equals the original **ID=0** logit at the very last token position (use the same prompt template and left-padding).
* **Frictionless integration**: Works out-of-the-box with **Sentence-Transformers CrossEncoder** and standard **Transformers** classification interfaces.
* **Fast & memory-light**: Computes a single logit (`hidden_size × 1`) instead of a full vocabulary projection.
* **Multilingual** and long-context (inherits capabilities from the base reranker).
---
## Model Overview
* **Type**: Text Reranking (single-logit SequenceClassification)
* **Base**: `ContextualAI/ctxl-rerank-v2-instruct-multilingual-1b` (Qwen3 CausalLM)
* **Languages**: 100+ (inherited)
* **Params**: \~1B (inherited)
* **Context Length**: up to 32K (inherited)
* **Scoring definition**: single logit ≡ original `next_logits[:, 0]`
---
## Input Formatting (keep this template)
```text
Check whether a given document contains information helpful to answer the query.
<Document> {document}
<Query> {query}{optional_instruction} ??
```
* Use **left padding** so the **last token** aligns across a batch.
* If the tokenizer has no `pad_token`, set `pad_token = eos_token`.
---
## Updated Usage
Below are **drop-in** examples for the converted model. These mirror the original card’s behavior but through **SequenceClassification**.
### Updated Sentence Transformers Usage (CrossEncoder)
```python
from sentence_transformers import CrossEncoder
MODEL_ID = "sigridjineth/ctxl-rerank-v2-1b-seq-cls" # or local folder
def format_prompts(query: str, instruction: str, docs: list[str]) -> list[str]:
inst = f" {instruction}" if instruction else ""
return [
"Check whether a given document contains information helpful to answer the query.\n"
f"<Document> {d}\n"
f"<Query> {query}{inst} ??"
for d in docs
]
query = "Which is a domestic animal?"
docs = ["Cats are pets.", "The moon is made of cheese.", "Dogs are loyal companions."]
ce = CrossEncoder(MODEL_ID, max_length=8192)
# Ensure original padding behavior
if ce.tokenizer.pad_token is None:
ce.tokenizer.pad_token = ce.tokenizer.eos_token
ce.tokenizer.padding_side = "left"
prompts = format_prompts(query, "", docs)
scores = ce.predict(prompts) # one logit per doc (higher = more relevant)
ranked = sorted(zip(scores, docs), key=lambda x: x[0], reverse=True)
for s, d in ranked:
print(f"{s:.4f} | {d}")
```
### Updated Transformers Usage (SequenceClassification)
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_ID = "sigridjineth/ctxl-rerank-v2-1b-seq-cls" # or local folder
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
def format_prompts(query: str, instruction: str, docs: list[str]) -> list[str]:
inst = f" {instruction}" if instruction else ""
return [
"Check whether a given document contains information helpful to answer the query.\n"
f"<Document> {d}\n"
f"<Query> {query}{inst} ??"
for d in docs
]
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if tok.pad_token is None:
tok.pad_token = tok.eos_token
tok.padding_side = "left"
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, torch_dtype=dtype).to(device).eval()
query = "Which is a domestic animal?"
docs = ["Cats are pets.", "The moon is made of cheese."]
prompts = format_prompts(query, "", docs)
enc = tok(prompts, return_tensors="pt", padding=True, truncation=True).to(device)
with torch.no_grad():
logits = model(**enc).logits.squeeze(-1) # [batch]
# Optional: exact parity rounding with original BF16 readout
scores = logits.to(torch.bfloat16).float().cpu().tolist()
ranked = sorted(zip(scores, docs), key=lambda x: x[0], reverse=True)
for s, d in ranked:
print(f"{s:.4f} | {d}")
```
> **Note on parity**: Casting the output logit to **bf16 then back to float** matches the original card’s BF16 rounding step.
---
## (Reference) Original Transformers Usage (CausalLM)
If you prefer to call the original model directly, compute `next_logits[:, -1, 0]` as specified in the base card.
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
BASE_ID = "ContextualAI/ctxl-rerank-v2-instruct-multilingual-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
def format_prompts(q: str, inst: str, docs: list[str]) -> list[str]:
inst = f" {inst}" if inst else ""
return [
"Check whether a given document contains information helpful to answer the query.\n"
f"<Document> {d}\n"
f"<Query> {q}{inst} ??"
for d in docs
]
tok = AutoTokenizer.from_pretrained(BASE_ID, use_fast=True)
if tok.pad_token is None:
tok.pad_token = tok.eos_token
tok.padding_side = "left"
lm = AutoModelForCausalLM.from_pretrained(BASE_ID, torch_dtype=dtype).to(device).eval()
docs = ["Cats are pets.", "The moon is made of cheese."]
prompts = format_prompts("Which is a domestic animal?", "", docs)
enc = tok(prompts, return_tensors="pt", padding=True, truncation=True).to(device)
with torch.no_grad():
out = lm(**enc).logits[:, -1, :] # [batch, vocab]
scores = out[:, 0].to(torch.bfloat16).float().cpu().tolist()
for s, d in sorted(zip(scores, docs), key=lambda x: x[0], reverse=True):
print(f"{s:.4f} | {d}")
```
---
## Conversion Details
* **Architecture**: `Qwen3ForSequenceClassification(num_labels=1)`
* **Head initialization**:
* `score.weight ← lm_head.weight[0]` (row for `vocab_id=0`)
* `score.bias ← 0` (or the corresponding bias term if present in LM head)
* **Tokenizer/Config**:
* Ensure `pad_token` exists (`pad_token = eos_token` if missing)
* Set `padding_side="left"`
* Propagate `pad/eos/bos` IDs into the model `config` for correct batching
* **Parity check**:
* Verified that `SequenceClassification` logit ≡ original `next_logits[:, 0]`
* Optional BF16 round-trip on the score for exact rounding parity
---
## Intended Use & Limitations
* **Use**: Document reranking for search/QA/multilingual scenarios; batch scoring of `(query, document)` prompts.
* **Not for**: Open-ended generation; the model emits a **single score** per input.
* **License constraints**: Non-commercial & Share-Alike. If you redistribute derivatives, include attribution and the same license.
* **Bias & safety**: Inherits all limitations and potential biases of the base model; evaluate before deployment.
---
## Requirements
* **Transformers** ≥ 4.51.0
* **PyTorch** with BF16 support recommended on GPU
* Long inputs: set `max_length` accordingly (up to the inherited context window)
---
## Citation
If you use this converted model, please cite the original work:
```bibtex
@misc{ctxl_rerank_v2_instruct_multilingual,
title = {Contextual AI Reranker v2},
author = {George Halal and Sheshansh Agrawal and Bo Han and Arnav Palkhiwala},
year = {2025},
url = {https://contextual.ai/blog/rerank-v2}
}
```
---
## License
This repository follows the original **Creative Commons Attribution Non Commercial Share Alike 4.0 (CC-BY-NC-SA-4.0)** license.
You **must** provide attribution, **may not** use it commercially, and **must** distribute derivatives under the same license.
---
## Acknowledgements
All modeling, training, and evaluation credit goes to **Contextual AI** for the original `ctxl-rerank-v2` family.
This repository provides a **compatibility conversion** to a single-logit `SequenceClassification` interface for easier integration and deployment.