Zero-Shot Image Classification
Transformers
Safetensors
siglip
vision

Poor Text-to-Image Retrieval Accuracy for Fine-Grained Attribute Queries

#17
by zhengthomastang - opened

Summary

We evaluated SigLIP 2 models (siglip2-base-patch16-224 and siglip2-giant-opt-patch16-384) for text-based person re-identification (ReID) on standard benchmarks including Market-1501 and RSTPReid. While image-to-image retrieval works reasonably well, text-to-image retrieval achieves < 5% Rank-1 accuracy across both datasets, even with semantically correct natural language queries.

Environment

  • transformers version: 4.49.0
  • torch version: 2.0+
  • Python version: 3.10+
  • Hardware: Tesla V100-PCIE-32GB

Models Tested

Model Parameters Input Size Embedding Dim
google/siglip2-base-patch16-224 86M 224×224 768
google/siglip2-giant-opt-patch16-384 1B 384×384 1536

Code to Reproduce

Text Embedding Extraction

import torch
from transformers import AutoModel, AutoProcessor

# Load model
model_name = "google/siglip2-giant-opt-patch16-384"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model = model.to("cuda")
model.eval()

# Example natural language queries for person re-identification
queries = [
    # From RSTPReid dataset style
    "A woman with long black hair wearing a white t-shirt and blue jeans",
    "A man in a black jacket carrying a backpack",
    "Young female wearing a red dress with short hair",
    
    # From Market-1501 attribute style  
    "Adult male wearing white sweater and black pants",
    "Female with short hair wearing a blue t-shirt",
    "Person wearing a red shirt and carrying a bag",
    
    # Simple attribute queries
    "Person wearing red",
    "Person wearing blue", 
    "Person with a backpack",
    "Adult male",
    "Adult female",
]

# Extract text embeddings (SigLIP 2 works best with lowercase)
queries_lower = [q.lower() for q in queries]
inputs = processor(text=queries_lower, return_tensors="pt", padding=True, truncation=True)
inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    text_features = model.get_text_features(**inputs)
    # Normalize embeddings
    text_embeddings = text_features / text_features.norm(dim=-1, keepdim=True)

print(f"Text embeddings shape: {text_embeddings.shape}")
# Expected: torch.Size([11, 1536]) for giant model

Image Embedding Extraction (for comparison)

from PIL import Image
import requests

# Load sample image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    image_features = model.get_image_features(**inputs)
    image_embeddings = image_features / image_features.norm(dim=-1, keepdim=True)

# Compute similarity
similarity = (text_embeddings @ image_embeddings.T).squeeze()
print("Text-Image Similarities:", similarity.cpu().numpy())

Datasets Used

We evaluated on standard person re-identification benchmarks:

Market-1501

  • Images: 29,419 (1,501 persons, 6 cameras)
  • Benchmark: Most widely used ReID dataset
  • Text queries: Natural language descriptions of person attributes (clothing color, type, accessories)

RSTPReid

  • Images: 20,505 (4,101 persons, 15 cameras)
  • Benchmark: Text-based person retrieval dataset with natural language captions
  • Text queries: Human-written descriptions like "A young woman with long hair wearing a white dress and carrying a handbag"

Results

Image-to-Image Retrieval ✅ (Works reasonably well)

Model Dataset mAP Rank-1
siglip2-base Market-1501 18.32% 89.40%
siglip2-giant Market-1501 20.11% 93.41%
siglip2-base RSTPReid - -
siglip2-giant RSTPReid - -

Text-to-Image Retrieval ❌ (Consistently poor)

Model Dataset mAP Rank-1
siglip2-base Market-1501 0.64% 0.64%
siglip2-giant Market-1501 0.38% 0.39%
siglip2-base RSTPReid 2.67% 2.50%
siglip2-giant RSTPReid 2.71% 2.38%

Key observations:

  • Rank-1 accuracy < 5% across all datasets
  • The larger giant model does NOT improve text-to-image accuracy

Sanity Checks Performed

1. Text-Text Similarity (Confirms embeddings are not noise)

# Compute pairwise similarity between text queries
text_similarity = text_embeddings @ text_embeddings.T

Most similar pairs (expected semantic similarity):

Similarity Query 1 Query 2
0.9986 "Person wearing black" "Person wearing white"
0.9981 "Person wearing red" "Person wearing black"
0.9980 "Adult male" "Adult female"
0.9980 "Person wearing red" "Person wearing blue"

Least similar pairs:

Similarity Query 1 Query 2
0.5686 "Female with short hair wearing a blue t-shirt" "Person wearing sneakers"
0.5625 "Person wearing a t-shirt" "Person in formal attire"

2. The Core Problem: Different Queries → Same Rankings

We observed that many semantically different queries produce nearly identical image rankings:

  • "Person wearing red" and "Person wearing blue" retrieve almost the same images
  • "Adult male" and "Adult female" retrieve almost the same images
  • Fine-grained attributes (color, clothing type) are not discriminative

This suggests the text encoder does not ground fine-grained visual attributes properly.


Note: We have verified that our evaluation pipeline is correct by testing with domain-specific models (IRRA) which achieve expected accuracy on the same datasets.

Tested text-to-image search on ImageNet using siglip2-giant-opt-patch16-384. The accuracy is still much lower than reported baseline. This indicates there might be some issues with the text encoder.

mAP:      19.42%
Rank-1:   23.40%
Rank-5:   27.66%
Rank-10:  38.30%

Sign up or log in to comment