Future multimodal support planned? EmbeddingGemma has vision tokens in tokenizer

#6
by AmanPriyanshu - opened

EmbeddingGemma's tokenizer has <start_of_image>, <end_of_image>, and <image_soft_token> even though it's text-only.

Are these placeholders for future multimodal versions? Would be awesome to know if there'll be follow ups in the EmbeddingGemma family

Hi @AmanPriyanshu ,

Thanks for reaching out to us, welcome to Google's Gemma family of open-source models. This is an inherent characteristic of the model family. All Gemma 3 models utilize a unified tokenizer that incorporates vision tokens as well, even when the model itself cannot make use of all the tokens in the vocabulary.

To know more about embeddinggemma please visit the following page.

Thanks.

Excited for it!
Thank you

AmanPriyanshu changed discussion status to closed

Sign up or log in to comment