Poor Text-to-Image Retrieval Accuracy for Fine-Grained Attribute Queries
Summary
We evaluated SigLIP 2 models (siglip2-base-patch16-224 and siglip2-giant-opt-patch16-384) for text-based person re-identification (ReID) on standard benchmarks including Market-1501 and RSTPReid. While image-to-image retrieval works reasonably well, text-to-image retrieval achieves < 5% Rank-1 accuracy across both datasets, even with semantically correct natural language queries.
Environment
- transformers version: 4.49.0
- torch version: 2.0+
- Python version: 3.10+
- Hardware: Tesla V100-PCIE-32GB
Models Tested
| Model | Parameters | Input Size | Embedding Dim |
|---|---|---|---|
google/siglip2-base-patch16-224 |
86M | 224×224 | 768 |
google/siglip2-giant-opt-patch16-384 |
1B | 384×384 | 1536 |
Code to Reproduce
Text Embedding Extraction
import torch
from transformers import AutoModel, AutoProcessor
# Load model
model_name = "google/siglip2-giant-opt-patch16-384"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model = model.to("cuda")
model.eval()
# Example natural language queries for person re-identification
queries = [
# From RSTPReid dataset style
"A woman with long black hair wearing a white t-shirt and blue jeans",
"A man in a black jacket carrying a backpack",
"Young female wearing a red dress with short hair",
# From Market-1501 attribute style
"Adult male wearing white sweater and black pants",
"Female with short hair wearing a blue t-shirt",
"Person wearing a red shirt and carrying a bag",
# Simple attribute queries
"Person wearing red",
"Person wearing blue",
"Person with a backpack",
"Adult male",
"Adult female",
]
# Extract text embeddings (SigLIP 2 works best with lowercase)
queries_lower = [q.lower() for q in queries]
inputs = processor(text=queries_lower, return_tensors="pt", padding=True, truncation=True)
inputs = {k: v.to("cuda") for k, v in inputs.items()}
with torch.no_grad():
text_features = model.get_text_features(**inputs)
# Normalize embeddings
text_embeddings = text_features / text_features.norm(dim=-1, keepdim=True)
print(f"Text embeddings shape: {text_embeddings.shape}")
# Expected: torch.Size([11, 1536]) for giant model
Image Embedding Extraction (for comparison)
from PIL import Image
import requests
# Load sample image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
with torch.no_grad():
image_features = model.get_image_features(**inputs)
image_embeddings = image_features / image_features.norm(dim=-1, keepdim=True)
# Compute similarity
similarity = (text_embeddings @ image_embeddings.T).squeeze()
print("Text-Image Similarities:", similarity.cpu().numpy())
Datasets Used
We evaluated on standard person re-identification benchmarks:
Market-1501
- Images: 29,419 (1,501 persons, 6 cameras)
- Benchmark: Most widely used ReID dataset
- Text queries: Natural language descriptions of person attributes (clothing color, type, accessories)
RSTPReid
- Images: 20,505 (4,101 persons, 15 cameras)
- Benchmark: Text-based person retrieval dataset with natural language captions
- Text queries: Human-written descriptions like "A young woman with long hair wearing a white dress and carrying a handbag"
Results
Image-to-Image Retrieval ✅ (Works reasonably well)
| Model | Dataset | mAP | Rank-1 |
|---|---|---|---|
| siglip2-base | Market-1501 | 18.32% | 89.40% |
| siglip2-giant | Market-1501 | 20.11% | 93.41% |
| siglip2-base | RSTPReid | - | - |
| siglip2-giant | RSTPReid | - | - |
Text-to-Image Retrieval ❌ (Consistently poor)
| Model | Dataset | mAP | Rank-1 |
|---|---|---|---|
| siglip2-base | Market-1501 | 0.64% | 0.64% |
| siglip2-giant | Market-1501 | 0.38% | 0.39% |
| siglip2-base | RSTPReid | 2.67% | 2.50% |
| siglip2-giant | RSTPReid | 2.71% | 2.38% |
Key observations:
- Rank-1 accuracy < 5% across all datasets
- The larger giant model does NOT improve text-to-image accuracy
Sanity Checks Performed
1. Text-Text Similarity (Confirms embeddings are not noise)
# Compute pairwise similarity between text queries
text_similarity = text_embeddings @ text_embeddings.T
Most similar pairs (expected semantic similarity):
| Similarity | Query 1 | Query 2 |
|---|---|---|
| 0.9986 | "Person wearing black" | "Person wearing white" |
| 0.9981 | "Person wearing red" | "Person wearing black" |
| 0.9980 | "Adult male" | "Adult female" |
| 0.9980 | "Person wearing red" | "Person wearing blue" |
Least similar pairs:
| Similarity | Query 1 | Query 2 |
|---|---|---|
| 0.5686 | "Female with short hair wearing a blue t-shirt" | "Person wearing sneakers" |
| 0.5625 | "Person wearing a t-shirt" | "Person in formal attire" |
2. The Core Problem: Different Queries → Same Rankings
We observed that many semantically different queries produce nearly identical image rankings:
- "Person wearing red" and "Person wearing blue" retrieve almost the same images
- "Adult male" and "Adult female" retrieve almost the same images
- Fine-grained attributes (color, clothing type) are not discriminative
This suggests the text encoder does not ground fine-grained visual attributes properly.
Note: We have verified that our evaluation pipeline is correct by testing with domain-specific models (IRRA) which achieve expected accuracy on the same datasets.
Tested text-to-image search on ImageNet using siglip2-giant-opt-patch16-384. The accuracy is still much lower than reported baseline. This indicates there might be some issues with the text encoder.
mAP: 19.42%
Rank-1: 23.40%
Rank-5: 27.66%
Rank-10: 38.30%