apple/MobileCLIP2-S0 · The MobileCLIP2-S0 checkpoint seems to have issues

I've prepared the following sample code

This sample code demonstrates the same inference which can be run on MobileCLIP2-S0 or MobileCLIP2-S3 by commenting and uncommenting MODEL_TO_LOAD and MODEL_CHECKPOINT

MobileClip2-S3 Results (Correct ✅)

Text: a photo of a dog
Most similar images:
dog_01.jpg                               89.88%
dog_02.jpg                               10.10%
dogs_01.jpg                              0.02%
cat_02.jpg                               0.00%
cat_01.jpg                               0.00%


Text: a dog
Most similar images:
dog_02.jpg                               71.79%
dog_01.jpg                               28.20%
dogs_01.jpg                              0.00%
cat_02.jpg                               0.00%
cat_01.jpg                               0.00%


Text: dogs
Most similar images:
dogs_01.jpg                              99.75%
dog_02.jpg                               0.14%
dog_01.jpg                               0.12%
cats_02.jpg                              0.00%
cats_01.jpg                              0.00%

MobileClip2-S0 Results (Wrong ❌)

Text: a photo of a dog
Most similar images:
dog_01.jpg                               47.39%
cat_01.jpg                               23.71%
people_01.jpg                            11.60%
cat_02.jpg                               10.60%
cats_01.jpg                              2.76%


Text: a dog
Most similar images:
dog_01.jpg                               62.96%
cat_01.jpg                               17.35%
cat_02.jpg                               9.95%
people_01.jpg                            5.25%
cats_02.jpg                              1.53%


Text: dogs
Most similar images:
cat_01.jpg                               78.92%
cats_02.jpg                              6.12%
dog_01.jpg                               5.00%
person_01.jpg                            4.43%
people_01.jpg                            3.28%

Note that when running this notebook locally on a Mac, it is able to convert the Pytorch checkpoint into an image and text encoder ML-Packages which reflects the same issue (Works well with S3, Fails with S0)

Hi @Norod78 ,
Thanks for reporting this issue. Our S0/S2/B variants need a different preprocessing/normalization than our S3/S4/L-14 variants. The normalization for S0/S2/B variants is the same as our v1 variants which is mean=(0,0,0) and std=(1,1,1) while for our S3/S4/L-14 variants it is openai mean/std (default openclip normalization).

model, _, preprocess = open_clip.create_model_and_transforms('MobileCLIP2-S0', pretrained='/path/to/mobileclip2_s0.pt', image_mean=(0,0,0), image_std=(1,1,1))

@rwightman has now integrated this into OpenCLIP and the correct preprocessing is loaded when one specifies pretrained="dfndr2b".
https://github.com/mlfoundations/open_clip/blob/13b01ec788c0c706a4d9ba66e301c8793aae0f0f/src/open_clip/pretrained.py#L629-L634