Transformers documentation
EoMT
This model was released on 2025-03-24 and added to Hugging Face Transformers on 2025-06-27.
EoMT
Overview
The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.
The abstract from the paper is the following:
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity.
This model was contributed by Yaswanth Gali. The original code can be found here.
Architecture Info
The EoMT model uses a DINOv2-pretrained Vision Transformer with register tokens as its backbone. EoMT simplifies the segmentation pipeline by relying solely on the encoder, eliminating the need for task-specific decoders commonly used in prior approaches.
Architecturally, EoMT introduces a small set of learned queries and a lightweight mask prediction module. These queries are injected into the final encoder blocks, enabling joint attention between image patches and object queries. During training, masked attention is applied to constrain each query to focus on its corresponding region—effectively mimicking cross-attention. This constraint is gradually phased out via a mask annealing strategy, allowing for efficient, decoder-free inference without compromising segmentation performance.

The model supports semantic, instance, and panoptic segmentation using a unified architecture and task-specific post-processing.
Usage Examples
Use the Hugging Face implementation of EoMT for inference with pre-trained models.
Semantic Segmentation
The EoMT model performs semantic segmentation using sliding-window inference. The input image is resized such that the shorter side matches the target input size, then it is split into overlapping crops. Each crop is then passed through the model. After inference, the predicted logits from each crop are stitched back together and rescaled to the original image size to get the final segmentation mask.
Note:
If you want to use a custom target size for semantic segmentation, specify it in the following format:
{"shortest_edge": 512}
Notice thatlongest_edgeis not provided here — this is intentional. For semantic segmentation, images are typically scaled so that the shortest edge is greater than or equal to the target size hence longest_edge is not necessary.
import matplotlib.pyplot as plt
import requests
import torch
from PIL import Image
from transformers import EomtForUniversalSegmentation, AutoImageProcessor
model_id = "tue-mps/ade20k_semantic_eomt_large_512"
processor = AutoImageProcessor.from_pretrained(model_id)
model = EomtForUniversalSegmentation.from_pretrained(model_id)
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
inputs = processor(
images=image,
return_tensors="pt",
)
with torch.inference_mode():
outputs = model(**inputs)
# Prepare the original image size in the format (height, width)
target_sizes = [(image.height, image.width)]
# Post-process the model outputs to get final segmentation prediction
preds = processor.post_process_semantic_segmentation(
outputs,
target_sizes=target_sizes,
)
# Visualize the segmentation mask
plt.imshow(preds[0])
plt.axis("off")
plt.title("Semantic Segmentation")
plt.show()Instance Segmentation
The EoMT model performs instance segmentation using padded inference. The input image is resized so that the longer side matches the target input size, and the shorter side is zero-padded to form a square. The resulting mask and class logits are combined through post-processing (adapted from Mask2Former) to produce a unified instance segmentation map, along with segment metadata like segment id, class labels and confidence scores.
Note:
To use a custom target size, specify the size as a dictionary in the following format:
{"shortest_edge": 512, "longest_edge": 512}
For both instance and panoptic segmentation, input images will be scaled and padded to this target size.
import matplotlib.pyplot as plt
import requests
import torch
from PIL import Image
from transformers import EomtForUniversalSegmentation, AutoImageProcessor
model_id = "tue-mps/coco_instance_eomt_large_640"
processor = AutoImageProcessor.from_pretrained(model_id)
model = EomtForUniversalSegmentation.from_pretrained(model_id)
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
inputs = processor(
images=image,
return_tensors="pt",
)
with torch.inference_mode():
outputs = model(**inputs)
# Prepare the original image size in the format (height, width)
target_sizes = [(image.height, image.width)]
# Post-process the model outputs to get final segmentation prediction
preds = processor.post_process_instance_segmentation(
outputs,
target_sizes=target_sizes,
)
# Visualize the segmentation mask
plt.imshow(preds[0]["segmentation"])
plt.axis("off")
plt.title("Instance Segmentation")
plt.show()Panoptic Segmentation
The EoMT model performs panoptic segmentation using the same padded inference strategy as in instance segmentation. After padding and normalization, the model predicts both thing (instances) and stuff (amorphous regions) classes. The resulting mask and class logits are combined through post-processing (adapted from Mask2Former) to produce a unified panoptic segmentation map, along with segment metadata like segment id, class labels and confidence scores.
import matplotlib.pyplot as plt
import requests
import torch
from PIL import Image
from transformers import EomtForUniversalSegmentation, AutoImageProcessor
model_id = "tue-mps/coco_panoptic_eomt_large_640"
processor = AutoImageProcessor.from_pretrained(model_id)
model = EomtForUniversalSegmentation.from_pretrained(model_id)
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
inputs = processor(
images=image,
return_tensors="pt",
)
with torch.inference_mode():
outputs = model(**inputs)
# Prepare the original image size in the format (height, width)
target_sizes = [(image.height, image.width)]
# Post-process the model outputs to get final segmentation prediction
preds = processor.post_process_panoptic_segmentation(
outputs,
target_sizes=target_sizes,
)
# Visualize the panoptic segmentation mask
plt.imshow(preds[0]["segmentation"])
plt.axis("off")
plt.title("Panoptic Segmentation")
plt.show()EomtImageProcessor
class transformers.EomtImageProcessor
< source >( **kwargs: typing_extensions.Unpack[transformers.models.eomt.image_processing_eomt.EomtImageProcessorKwargs] )
Parameters
- do_split_image (
bool, kwargs, optional, defaults toself.do_split_image) — Whether to split the input images into overlapping patches for semantic segmentation. If set toTrue, the input images will be split into patches of sizesize["shortest_edge"]with an overlap between patches. Otherwise, the input images will be padded to the target size. - ignore_index (
int, kwargs, optional, defaults toself.ignore_index) — Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced withignore_index. - **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.
Constructs a EomtImageProcessor image processor.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: list[torch.Tensor] | None = None instance_id_to_semantic_id: dict[int, int] | None = None **kwargs: typing_extensions.Unpack[transformers.models.eomt.image_processing_eomt.EomtImageProcessorKwargs] ) → ~image_processing_base.BatchFeature
Parameters
- images (
Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False. - segmentation_maps (
ImageInput, optional) — The segmentation maps to preprocess for corresponding images. - instance_id_to_semantic_id (
list[dict[int, int]]ordict[int, int], optional) — A mapping between object instance ids and class ids. - do_split_image (
bool, kwargs, optional, defaults toself.do_split_image) — Whether to split the input images into overlapping patches for semantic segmentation. If set toTrue, the input images will be split into patches of sizesize["shortest_edge"]with an overlap between patches. Otherwise, the input images will be padded to the target size. - ignore_index (
int, kwargs, optional, defaults toself.ignore_index) — Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced withignore_index. - return_tensors (
stror TensorType, optional) — Returns stacked tensors if set to'pt', otherwise returns a list of tensors. - **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.
Returns
~image_processing_base.BatchFeature
- data (
dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.). - tensor_type (
Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.
post_process_semantic_segmentation
< source >( outputs target_sizes: list size: dict[str, int] | None = None )
Post-processes model outputs into final semantic segmentation prediction.
post_process_instance_segmentation
< source >( outputs target_sizes: list threshold: float = 0.8 size: dict[str, int] | None = None )
Post-processes model outputs into Instance Segmentation Predictions.
post_process_panoptic_segmentation
< source >( outputs target_sizes: list threshold: float = 0.8 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 stuff_classes: list[int] | None = None size: dict[str, int] | None = None )
Post-processes model outputs into final panoptic segmentation prediction.
EomtImageProcessorPil
class transformers.EomtImageProcessorPil
< source >( **kwargs: typing_extensions.Unpack[transformers.models.eomt.image_processing_eomt.EomtImageProcessorKwargs] )
Parameters
- do_split_image (
bool, kwargs, optional, defaults toself.do_split_image) — Whether to split the input images into overlapping patches for semantic segmentation. If set toTrue, the input images will be split into patches of sizesize["shortest_edge"]with an overlap between patches. Otherwise, the input images will be padded to the target size. - ignore_index (
int, kwargs, optional, defaults toself.ignore_index) — Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced withignore_index. - **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.
Constructs a EomtImageProcessor image processor.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: list[torch.Tensor] | None = None instance_id_to_semantic_id: dict[int, int] | None = None **kwargs: typing_extensions.Unpack[transformers.models.eomt.image_processing_eomt.EomtImageProcessorKwargs] ) → ~image_processing_base.BatchFeature
Parameters
- images (
Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False. - segmentation_maps (
ImageInput, optional) — The segmentation maps to preprocess for corresponding images. - instance_id_to_semantic_id (
list[dict[int, int]]ordict[int, int], optional) — A mapping between object instance ids and class ids. - do_split_image (
bool, kwargs, optional, defaults toself.do_split_image) — Whether to split the input images into overlapping patches for semantic segmentation. If set toTrue, the input images will be split into patches of sizesize["shortest_edge"]with an overlap between patches. Otherwise, the input images will be padded to the target size. - ignore_index (
int, kwargs, optional, defaults toself.ignore_index) — Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced withignore_index. - return_tensors (
stror TensorType, optional) — Returns stacked tensors if set to'pt', otherwise returns a list of tensors. - **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.
Returns
~image_processing_base.BatchFeature
- data (
dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.). - tensor_type (
Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.
post_process_semantic_segmentation
< source >( outputs target_sizes: list size: dict[str, int] | None = None )
Post-processes model outputs into final semantic segmentation prediction.
post_process_instance_segmentation
< source >( outputs target_sizes: list threshold: float = 0.8 size: dict[str, int] | None = None )
Post-processes model outputs into Instance Segmentation Predictions.
post_process_panoptic_segmentation
< source >( outputs target_sizes: list threshold: float = 0.8 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 stuff_classes: list[int] | None = None size: dict[str, int] | None = None )
Post-processes model outputs into final panoptic segmentation prediction.
EomtConfig
class transformers.EomtConfig
< source >( output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None tokenizer_class: str | transformers.tokenization_utils_base.PreTrainedTokenizerBase | None = None hidden_size: int = 1024 num_hidden_layers: int = 24 num_attention_heads: int = 16 hidden_act: str = 'gelu' hidden_dropout_prob: float = 0.0 initializer_range: float = 0.02 layer_norm_eps: float = 1e-06 image_size: int | list[int] | tuple[int, int] = 640 patch_size: int | list[int] | tuple[int, int] = 16 num_channels: int = 3 mlp_ratio: int = 4 layerscale_value: float = 1.0 drop_path_rate: float = 0.0 num_upscale_blocks: int = 2 attention_dropout: float | int = 0.0 use_swiglu_ffn: bool = False num_blocks: int = 4 no_object_weight: float = 0.1 class_weight: float = 2.0 mask_weight: float = 5.0 dice_weight: float = 5.0 train_num_points: int = 12544 oversample_ratio: float = 3.0 importance_sample_ratio: float = 0.75 num_queries: int = 200 num_register_tokens: int = 4 )
Parameters
- output_hidden_states (
bool, optional, defaults toFalse) — Whether or not the model should return all hidden-states. - return_dict (
bool, optional, defaults toTrue) — Whether to return aModelOutput(dataclass) instead of a plain tuple. - dtype (
Union[str, torch.dtype], optional) — The chunk size of all feed forward layers in the residual attention blocks. A chunk size of0means that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processesn< sequence_length embeddings at a time. For more information on feed forward chunking, see How does Feed Forward Chunking work?. - chunk_size_feed_forward (
int, optional, defaults to0) — Thedtypeof the weights. This attribute can be used to initialize the model to a non-defaultdtype(which is normallyfloat32) and thus allow for optimal storage allocation. For example, if the saved model isfloat16, ideally we want to load it back using the minimal amount of memory needed to loadfloat16weights. - is_encoder_decoder (
bool, optional, defaults toFalse) — Whether the model is used as an encoder/decoder or not. - id2label (
Union[dict[int, str], dict[str, str]], optional) — A map from index (for instance prediction index, or target index) to label. - label2id (
Union[dict[str, int], dict[str, str]], optional) — A map from label to index for the model. - problem_type (
Literal[regression, single_label_classification, multi_label_classification], optional) — Problem type forXxxForSequenceClassificationmodels. Can be one of"regression","single_label_classification"or"multi_label_classification". - tokenizer_class (
Union[str, ~tokenization_utils_base.PreTrainedTokenizerBase], optional) — The class name of model’s tokenizer. - hidden_size (
int, optional, defaults to1024) — Dimension of the hidden representations. - num_hidden_layers (
int, optional, defaults to24) — Number of hidden layers in the Transformer decoder. - num_attention_heads (
int, optional, defaults to16) — Number of attention heads for each attention layer in the Transformer decoder. - hidden_act (
str, optional, defaults togelu) — The non-linear activation function (function or string) in the decoder. For example,"gelu","relu","silu", etc. - hidden_dropout_prob (
float, optional, defaults to0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. - initializer_range (
float, optional, defaults to0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - layer_norm_eps (
float, optional, defaults to1e-06) — The epsilon used by the layer normalization layers. - image_size (
Union[int, list[int], tuple[int, int]], optional, defaults to640) — The size (resolution) of each image. - patch_size (
Union[int, list[int], tuple[int, int]], optional, defaults to16) — The size (resolution) of each patch. - num_channels (
int, optional, defaults to3) — The number of input channels. - mlp_ratio (
int, optional, defaults to4) — Ratio of the MLP hidden dim to the embedding dim. - layerscale_value (
float, optional, defaults to 1.0) — Initial value for the LayerScale parameter. - drop_path_rate (
float, optional, defaults to0.0) — Drop path rate for the patch fusion. - num_upscale_blocks (
int, optional, defaults to 2) — Number of upsampling blocks used in the decoder or segmentation head. - attention_dropout (
Union[float, int], optional, defaults to0.0) — The dropout ratio for the attention probabilities. - use_swiglu_ffn (
bool, optional, defaults toFalse) — Whether to use the SwiGLU feedforward neural network. - num_blocks (
int, optional, defaults to 4) — Number of feature blocks or stages in the architecture. - no_object_weight (
float, optional, defaults to 0.1) — Loss weight for the ‘no object’ class in panoptic/instance segmentation. - class_weight (
float, optional, defaults to 2.0) — Loss weight for classification targets. - mask_weight (
float, optional, defaults to 5.0) — Loss weight for mask prediction. - dice_weight (
float, optional, defaults to5.0) — Relative weight of the dice loss in the panoptic segmentation loss. - train_num_points (
int, optional, defaults to 12544) — Number of points to sample for mask loss computation during training. - oversample_ratio (
float, optional, defaults to 3.0) — Oversampling ratio used in point sampling for mask training. - importance_sample_ratio (
float, optional, defaults to 0.75) — Ratio of points to sample based on importance during training. - num_queries (
int, optional, defaults to 200) — Number of object queries in the Transformer. - num_register_tokens (
int, optional, defaults to 4) — Number of learnable register tokens added to the transformer input.
This is the configuration class to store the configuration of a EomtModel. It is used to instantiate a Eomt model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the tue-mps/coco_panoptic_eomt_large_640
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
EomtForUniversalSegmentation
class transformers.EomtForUniversalSegmentation
< source >( config: EomtConfig )
Parameters
- config (EomtConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The EoMT Model with head on top for instance/semantic/panoptic segmentation.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: Tensor mask_labels: list[torch.Tensor] | None = None class_labels: list[torch.Tensor] | None = None patch_offsets: list[torch.Tensor] | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → EomtForUniversalSegmentationOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.Tensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using EomtImageProcessor. SeeEomtImageProcessor.__call__()for details (processor_classuses EomtImageProcessor for processing images). - mask_labels (
list[torch.Tensor], optional) — list of mask labels of shape(num_labels, height, width)to be fed to a model - class_labels (
list[torch.LongTensor], optional) — list of target class labels of shape(num_labels, height, width)to be fed to a model. They identify the labels ofmask_labels, e.g. the label ofmask_labels[i][j]ifclass_labels[i][j]. - patch_offsets (
list[torch.Tensor], optional) — list of tuples indicating the image index and start and end positions of patches for semantic segmentation.
Returns
EomtForUniversalSegmentationOutput or tuple(torch.FloatTensor)
A EomtForUniversalSegmentationOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (EomtConfig) and inputs.
The EomtForUniversalSegmentation forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
- loss (
torch.Tensor, optional) — The computed loss, returned when labels are present. - class_queries_logits (
torch.FloatTensor, optional, defaults toNone) — A tensor of shape(batch_size, num_queries, num_labels + 1)representing the proposed classes for each query. Note the+ 1is needed because we incorporate the null class. - masks_queries_logits (
torch.FloatTensor, optional, defaults toNone) — A tensor of shape(batch_size, num_queries, height, width)representing the proposed masks for each query. - last_hidden_state (
torch.FloatTensorof shape(batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last layer. - hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each stage) of shape(batch_size, sequence_length, hidden_size). Hidden-states all layers of the model. - attentions (
tuple(tuple(torch.FloatTensor)), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftuple(torch.FloatTensor)(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length). Self and Cross Attentions weights from transformer decoder. - patch_offsets (
list[torch.Tensor], optional) — list of tuples indicating the image index and start and end positions of patches for semantic segmentation.