Poor Text-to-Image Retrieval Accuracy for Fine-Grained Attribute Queries

#17

by zhengthomastang - opened 19 days ago

19 days ago

Summary

We evaluated SigLIP 2 models (siglip2-base-patch16-224 and siglip2-giant-opt-patch16-384) for text-based person re-identification (ReID) on standard benchmarks including Market-1501 and RSTPReid. While image-to-image retrieval works reasonably well, text-to-image retrieval achieves < 5% Rank-1 accuracy across both datasets, even with semantically correct natural language queries.

Environment

transformers version: 4.49.0
torch version: 2.0+
Python version: 3.10+
Hardware: Tesla V100-PCIE-32GB

Models Tested

Model	Parameters	Input Size	Embedding Dim
`google/siglip2-base-patch16-224`	86M	224×224	768
`google/siglip2-giant-opt-patch16-384`	1B	384×384	1536

Code to Reproduce

Text Embedding Extraction

import torch
from transformers import AutoModel, AutoProcessor

# Load model
model_name = "google/siglip2-giant-opt-patch16-384"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model = model.to("cuda")
model.eval()

# Example natural language queries for person re-identification
queries = [
    # From RSTPReid dataset style
    "A woman with long black hair wearing a white t-shirt and blue jeans",
    "A man in a black jacket carrying a backpack",
    "Young female wearing a red dress with short hair",
    
    # From Market-1501 attribute style  
    "Adult male wearing white sweater and black pants",
    "Female with short hair wearing a blue t-shirt",
    "Person wearing a red shirt and carrying a bag",
    
    # Simple attribute queries
    "Person wearing red",
    "Person wearing blue", 
    "Person with a backpack",
    "Adult male",
    "Adult female",
]

# Extract text embeddings (SigLIP 2 works best with lowercase)
queries_lower = [q.lower() for q in queries]
inputs = processor(text=queries_lower, return_tensors="pt", padding=True, truncation=True)
inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    text_features = model.get_text_features(**inputs)
    # Normalize embeddings
    text_embeddings = text_features / text_features.norm(dim=-1, keepdim=True)

print(f"Text embeddings shape: {text_embeddings.shape}")
# Expected: torch.Size([11, 1536]) for giant model

Image Embedding Extraction (for comparison)

from PIL import Image
import requests

# Load sample image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    image_features = model.get_image_features(**inputs)
    image_embeddings = image_features / image_features.norm(dim=-1, keepdim=True)

# Compute similarity
similarity = (text_embeddings @ image_embeddings.T).squeeze()
print("Text-Image Similarities:", similarity.cpu().numpy())

Datasets Used

We evaluated on standard person re-identification benchmarks:

Market-1501

Images: 29,419 (1,501 persons, 6 cameras)
Benchmark: Most widely used ReID dataset
Text queries: Natural language descriptions of person attributes (clothing color, type, accessories)

RSTPReid

Images: 20,505 (4,101 persons, 15 cameras)
Benchmark: Text-based person retrieval dataset with natural language captions
Text queries: Human-written descriptions like "A young woman with long hair wearing a white dress and carrying a handbag"

Results

Image-to-Image Retrieval ✅ (Works reasonably well)

Model	Dataset	mAP	Rank-1
siglip2-base	Market-1501	18.32%	89.40%
siglip2-giant	Market-1501	20.11%	93.41%
siglip2-base	RSTPReid	-	-
siglip2-giant	RSTPReid	-	-

Text-to-Image Retrieval ❌ (Consistently poor)

Model	Dataset	mAP	Rank-1
siglip2-base	Market-1501	0.64%	0.64%
siglip2-giant	Market-1501	0.38%	0.39%
siglip2-base	RSTPReid	2.67%	2.50%
siglip2-giant	RSTPReid	2.71%	2.38%

Key observations:

Rank-1 accuracy < 5% across all datasets
The larger giant model does NOT improve text-to-image accuracy

Sanity Checks Performed

1. Text-Text Similarity (Confirms embeddings are not noise)

# Compute pairwise similarity between text queries
text_similarity = text_embeddings @ text_embeddings.T

Most similar pairs (expected semantic similarity):

Similarity	Query 1	Query 2
0.9986	"Person wearing black"	"Person wearing white"
0.9981	"Person wearing red"	"Person wearing black"
0.9980	"Adult male"	"Adult female"
0.9980	"Person wearing red"	"Person wearing blue"

Least similar pairs:

Similarity	Query 1	Query 2
0.5686	"Female with short hair wearing a blue t-shirt"	"Person wearing sneakers"
0.5625	"Person wearing a t-shirt"	"Person in formal attire"

2. The Core Problem: Different Queries → Same Rankings

We observed that many semantically different queries produce nearly identical image rankings:

"Person wearing red" and "Person wearing blue" retrieve almost the same images
"Adult male" and "Adult female" retrieve almost the same images
Fine-grained attributes (color, clothing type) are not discriminative

This suggests the text encoder does not ground fine-grained visual attributes properly.

Note: We have verified that our evaluation pipeline is correct by testing with domain-specific models (IRRA) which achieve expected accuracy on the same datasets.

zhengthomastang

13 days ago

Tested text-to-image search on ImageNet using siglip2-giant-opt-patch16-384. The accuracy is still much lower than reported baseline. This indicates there might be some issues with the text encoder.

mAP:      19.42%
Rank-1:   23.40%
Rank-5:   27.66%
Rank-10:  38.30%

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment