--- library_name: transformers license: cc-by-nc-sa-4.0 pipeline_tag: text-ranking --- # Contextual AI Reranker v2 6B ## Highlights Our reranker is on the cost/performance Pareto frontier across 5 key areas: - Instruction following (including capability to rank more recent information higher) - Question answering - Multilinguality - Product search / recommendation systems - Real-world use cases

For more details on these and other benchmarks, please refer to our [blogpost](https://contextual.ai/blog/rerank-v2). ## Overview - Model Type: Text Reranking - Supported Languages: 100+ - Number of Paramaters: 6B - Context Length: up to 32K - Blogpost: https://contextual.ai/blog/rerank-v2 ## Quickstart ### vLLM usage Requires vllm==0.10.0 for NVFP4 or vllm>=0.8.5 for BF16. ```python import os os.environ['VLLM_USE_V1'] = '0' # v1 engine doesn’t support logits processor yet import torch from vllm import LLM, SamplingParams def logits_processor(_, scores): """Custom logits processor for vLLM reranking.""" index = scores[0].view(torch.uint16) scores = torch.full_like(scores, float("-inf")) scores[index] = 1 return scores def format_prompts(query: str, instruction: str, documents: list[str]) -> list[str]: """Format query and documents into prompts for reranking.""" if instruction: instruction = f" {instruction}" prompts = [] for doc in documents: prompt = f"Check whether a given document contains information helpful to answer the query.\n {doc}\n {query}{instruction} ??" prompts.append(prompt) return prompts def infer_w_vllm(model_path: str, query: str, instruction: str, documents: list[str]): model = LLM( model=model_path, gpu_memory_utilization=0.85, max_model_len=8192, dtype="bfloat16", max_logprobs=2, max_num_batched_tokens=262144, ) sampling_params = SamplingParams( temperature=0, max_tokens=1, logits_processors=[logits_processor] ) prompts = format_prompts(query, instruction, documents) outputs = model.generate(prompts, sampling_params, use_tqdm=False) # Extract scores and create results results = [] for i, output in enumerate(outputs): score = ( torch.tensor([output.outputs[0].token_ids[0]], dtype=torch.uint16) .view(torch.bfloat16) .item() ) results.append((score, i, documents[i])) # Sort by score (descending) results = sorted(results, key=lambda x: x[0], reverse=True) print(f"Query: {query}") print(f"Instruction: {instruction}") for score, doc_id, doc in results: print(f"Score: {score:.4f} | Doc: {doc}") ``` ### Transformers Usage Requires transformers>=4.51.0 for BF16. Not supported for NVFP4. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM def format_prompts(query: str, instruction: str, documents: list[str]) -> list[str]: """Format query and documents into prompts for reranking.""" if instruction: instruction = f" {instruction}" prompts = [] for doc in documents: prompt = f"Check whether a given document contains information helpful to answer the query.\n {doc}\n {query}{instruction} ??" prompts.append(prompt) return prompts def infer_w_hf(model_path: str, query: str, instruction: str, documents: list[str]): device = "cuda" if torch.cuda.is_available() else "cpu" dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32 tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "left" # so -1 is the real last token for all prompts model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=dtype).to(device) model.eval() prompts = format_prompts(query, instruction, documents) enc = tokenizer( prompts, return_tensors="pt", padding=True, truncation=True, ) input_ids = enc["input_ids"].to(device) attention_mask = enc["attention_mask"].to(device) with torch.no_grad(): out = model(input_ids=input_ids, attention_mask=attention_mask) next_logits = out.logits[:, -1, :] # [batch, vocab] scores_bf16 = next_logits[:, 0].to(torch.bfloat16) scores = scores_bf16.float().tolist() # Sort by score (descending) results = sorted([(s, i, documents[i]) for i, s in enumerate(scores)], key=lambda x: x[0], reverse=True) print(f"Query: {query}") print(f"Instruction: {instruction}") for score, doc_id, doc in results: print(f"Score: {score:.4f} | Doc: {doc}") ``` ## Citation If you use this model, please cite: ```bibtex @misc{ctxl_rerank_v2_instruct_multilingual, title={Contextual AI Reranker v2}, author={George Halal, Sheshansh Agrawal}, year={2025}, url={https://contextual.ai/blog/rerank-v2}, } ``` ## License Creative Commons Attribution Non Commercial Share Alike 4.0 (cc-by-nc-sa-4.0) ## Contact For questions or issues, please open an issue on the model repository or contact george@contextual.ai.