Transformers documentation

EoMT

Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.3.0).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2025-03-24 and added to Hugging Face Transformers on 2025-06-27.

EoMT

Overview

The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.

The abstract from the paper is the following:

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity.

This model was contributed by Yaswanth Gali. The original code can be found here.

Architecture Info

The EoMT model uses a DINOv2-pretrained Vision Transformer with register tokens as its backbone. EoMT simplifies the segmentation pipeline by relying solely on the encoder, eliminating the need for task-specific decoders commonly used in prior approaches.

Architecturally, EoMT introduces a small set of learned queries and a lightweight mask prediction module. These queries are injected into the final encoder blocks, enabling joint attention between image patches and object queries. During training, masked attention is applied to constrain each query to focus on its corresponding region—effectively mimicking cross-attention. This constraint is gradually phased out via a mask annealing strategy, allowing for efficient, decoder-free inference without compromising segmentation performance.

The model supports semantic, instance, and panoptic segmentation using a unified architecture and task-specific post-processing.

Usage Examples

Use the Hugging Face implementation of EoMT for inference with pre-trained models.

Semantic Segmentation

The EoMT model performs semantic segmentation using sliding-window inference. The input image is resized such that the shorter side matches the target input size, then it is split into overlapping crops. Each crop is then passed through the model. After inference, the predicted logits from each crop are stitched back together and rescaled to the original image size to get the final segmentation mask.

Note:
If you want to use a custom target size for semantic segmentation, specify it in the following format:
{"shortest_edge": 512}
Notice that longest_edge is not provided here — this is intentional. For semantic segmentation, images are typically scaled so that the shortest edge is greater than or equal to the target size hence longest_edge is not necessary.

import matplotlib.pyplot as plt
import requests
import torch
from PIL import Image

from transformers import EomtForUniversalSegmentation, AutoImageProcessor


model_id = "tue-mps/ade20k_semantic_eomt_large_512"
processor = AutoImageProcessor.from_pretrained(model_id)
model = EomtForUniversalSegmentation.from_pretrained(model_id)

image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

inputs = processor(
    images=image,
    return_tensors="pt",
)

with torch.inference_mode():
    outputs = model(**inputs)

# Prepare the original image size in the format (height, width)
target_sizes = [(image.height, image.width)]

# Post-process the model outputs to get final segmentation prediction
preds = processor.post_process_semantic_segmentation(
    outputs,
    target_sizes=target_sizes,
)

# Visualize the segmentation mask
plt.imshow(preds[0])
plt.axis("off")
plt.title("Semantic Segmentation")
plt.show()

Instance Segmentation

The EoMT model performs instance segmentation using padded inference. The input image is resized so that the longer side matches the target input size, and the shorter side is zero-padded to form a square. The resulting mask and class logits are combined through post-processing (adapted from Mask2Former) to produce a unified instance segmentation map, along with segment metadata like segment id, class labels and confidence scores.

Note:
To use a custom target size, specify the size as a dictionary in the following format:
{"shortest_edge": 512, "longest_edge": 512}
For both instance and panoptic segmentation, input images will be scaled and padded to this target size.

import matplotlib.pyplot as plt
import requests
import torch
from PIL import Image

from transformers import EomtForUniversalSegmentation, AutoImageProcessor


model_id = "tue-mps/coco_instance_eomt_large_640"
processor = AutoImageProcessor.from_pretrained(model_id)
model = EomtForUniversalSegmentation.from_pretrained(model_id)

image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

inputs = processor(
    images=image,
    return_tensors="pt",
)

with torch.inference_mode():
    outputs = model(**inputs)

# Prepare the original image size in the format (height, width)
target_sizes = [(image.height, image.width)]

# Post-process the model outputs to get final segmentation prediction
preds = processor.post_process_instance_segmentation(
    outputs,
    target_sizes=target_sizes,
)

# Visualize the segmentation mask
plt.imshow(preds[0]["segmentation"])
plt.axis("off")
plt.title("Instance Segmentation")
plt.show()

Panoptic Segmentation

The EoMT model performs panoptic segmentation using the same padded inference strategy as in instance segmentation. After padding and normalization, the model predicts both thing (instances) and stuff (amorphous regions) classes. The resulting mask and class logits are combined through post-processing (adapted from Mask2Former) to produce a unified panoptic segmentation map, along with segment metadata like segment id, class labels and confidence scores.

import matplotlib.pyplot as plt
import requests
import torch
from PIL import Image

from transformers import EomtForUniversalSegmentation, AutoImageProcessor


model_id = "tue-mps/coco_panoptic_eomt_large_640"
processor = AutoImageProcessor.from_pretrained(model_id)
model = EomtForUniversalSegmentation.from_pretrained(model_id)

image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

inputs = processor(
    images=image,
    return_tensors="pt",
)

with torch.inference_mode():
    outputs = model(**inputs)

# Prepare the original image size in the format (height, width)
target_sizes = [(image.height, image.width)]

# Post-process the model outputs to get final segmentation prediction
preds = processor.post_process_panoptic_segmentation(
    outputs,
    target_sizes=target_sizes,
)

# Visualize the panoptic segmentation mask
plt.imshow(preds[0]["segmentation"])
plt.axis("off")
plt.title("Panoptic Segmentation")
plt.show()

EomtImageProcessor

class transformers.EomtImageProcessor

< source >

( **kwargs: typing_extensions.Unpack[transformers.models.eomt.image_processing_eomt.EomtImageProcessorKwargs] )

Parameters

do_split_image (bool, kwargs, optional, defaults to self.do_split_image) — Whether to split the input images into overlapping patches for semantic segmentation. If set to True, the input images will be split into patches of size size["shortest_edge"] with an overlap between patches. Otherwise, the input images will be padded to the target size.
ignore_index (int, kwargs, optional, defaults to self.ignore_index) — Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced with ignore_index.
**kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.

Constructs a EomtImageProcessor image processor.

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: list[torch.Tensor] | None = None instance_id_to_semantic_id: dict[int, int] | None = None **kwargs: typing_extensions.Unpack[transformers.models.eomt.image_processing_eomt.EomtImageProcessorKwargs] ) → ~image_processing_base.BatchFeature

Parameters

images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.
segmentation_maps (ImageInput, optional) — The segmentation maps to preprocess for corresponding images.
instance_id_to_semantic_id (list[dict[int, int]] or dict[int, int], optional) — A mapping between object instance ids and class ids.
do_split_image (bool, kwargs, optional, defaults to self.do_split_image) — Whether to split the input images into overlapping patches for semantic segmentation. If set to True, the input images will be split into patches of size size["shortest_edge"] with an overlap between patches. Otherwise, the input images will be padded to the target size.
ignore_index (int, kwargs, optional, defaults to self.ignore_index) — Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced with ignore_index.
return_tensors (str or TensorType, optional) — Returns stacked tensors if set to 'pt', otherwise returns a list of tensors.
**kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.

Returns

~image_processing_base.BatchFeature

data (dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.).
tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.

post_process_semantic_segmentation

< source >

( outputs target_sizes: list size: dict[str, int] | None = None )

Post-processes model outputs into final semantic segmentation prediction.

post_process_instance_segmentation

< source >

( outputs target_sizes: list threshold: float = 0.8 size: dict[str, int] | None = None )

Post-processes model outputs into Instance Segmentation Predictions.

post_process_panoptic_segmentation

< source >

( outputs target_sizes: list threshold: float = 0.8 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 stuff_classes: list[int] | None = None size: dict[str, int] | None = None )

Post-processes model outputs into final panoptic segmentation prediction.

EomtImageProcessorPil

class transformers.EomtImageProcessorPil

< source >

( **kwargs: typing_extensions.Unpack[transformers.models.eomt.image_processing_eomt.EomtImageProcessorKwargs] )

Parameters

do_split_image (bool, kwargs, optional, defaults to self.do_split_image) — Whether to split the input images into overlapping patches for semantic segmentation. If set to True, the input images will be split into patches of size size["shortest_edge"] with an overlap between patches. Otherwise, the input images will be padded to the target size.
ignore_index (int, kwargs, optional, defaults to self.ignore_index) — Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced with ignore_index.
**kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.

Constructs a EomtImageProcessor image processor.

preprocess

< source >

Parameters

images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.
segmentation_maps (ImageInput, optional) — The segmentation maps to preprocess for corresponding images.
instance_id_to_semantic_id (list[dict[int, int]] or dict[int, int], optional) — A mapping between object instance ids and class ids.
do_split_image (bool, kwargs, optional, defaults to self.do_split_image) — Whether to split the input images into overlapping patches for semantic segmentation. If set to True, the input images will be split into patches of size size["shortest_edge"] with an overlap between patches. Otherwise, the input images will be padded to the target size.
ignore_index (int, kwargs, optional, defaults to self.ignore_index) — Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced with ignore_index.
return_tensors (str or TensorType, optional) — Returns stacked tensors if set to 'pt', otherwise returns a list of tensors.
**kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.

Returns

~image_processing_base.BatchFeature

data (dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.).
tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.

post_process_semantic_segmentation

< source >

( outputs target_sizes: list size: dict[str, int] | None = None )

Post-processes model outputs into final semantic segmentation prediction.

post_process_instance_segmentation

< source >

( outputs target_sizes: list threshold: float = 0.8 size: dict[str, int] | None = None )

Post-processes model outputs into Instance Segmentation Predictions.

post_process_panoptic_segmentation

< source >

( outputs target_sizes: list threshold: float = 0.8 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 stuff_classes: list[int] | None = None size: dict[str, int] | None = None )

Post-processes model outputs into final panoptic segmentation prediction.

EomtConfig

class transformers.EomtConfig

< source >

( output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None tokenizer_class: str | transformers.tokenization_utils_base.PreTrainedTokenizerBase | None = None hidden_size: int = 1024 num_hidden_layers: int = 24 num_attention_heads: int = 16 hidden_act: str = 'gelu' hidden_dropout_prob: float = 0.0 initializer_range: float = 0.02 layer_norm_eps: float = 1e-06 image_size: int | list[int] | tuple[int, int] = 640 patch_size: int | list[int] | tuple[int, int] = 16 num_channels: int = 3 mlp_ratio: int = 4 layerscale_value: float = 1.0 drop_path_rate: float = 0.0 num_upscale_blocks: int = 2 attention_dropout: float | int = 0.0 use_swiglu_ffn: bool = False num_blocks: int = 4 no_object_weight: float = 0.1 class_weight: float = 2.0 mask_weight: float = 5.0 dice_weight: float = 5.0 train_num_points: int = 12544 oversample_ratio: float = 3.0 importance_sample_ratio: float = 0.75 num_queries: int = 200 num_register_tokens: int = 4 )

Parameters

output_hidden_states (bool, optional, defaults to False) — Whether or not the model should return all hidden-states.
return_dict (bool, optional, defaults to True) — Whether to return a ModelOutput (dataclass) instead of a plain tuple.
dtype (Union[str, torch.dtype], optional) — The chunk size of all feed forward layers in the residual attention blocks. A chunk size of 0 means that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time. For more information on feed forward chunking, see How does Feed Forward Chunking work?.
chunk_size_feed_forward (int, optional, defaults to 0) — The dtype of the weights. This attribute can be used to initialize the model to a non-default dtype (which is normally float32) and thus allow for optimal storage allocation. For example, if the saved model is float16, ideally we want to load it back using the minimal amount of memory needed to load float16 weights.
is_encoder_decoder (bool, optional, defaults to False) — Whether the model is used as an encoder/decoder or not.
id2label (Union[dict[int, str], dict[str, str]], optional) — A map from index (for instance prediction index, or target index) to label.
label2id (Union[dict[str, int], dict[str, str]], optional) — A map from label to index for the model.
problem_type (Literal[regression, single_label_classification, multi_label_classification], optional) — Problem type for XxxForSequenceClassification models. Can be one of "regression", "single_label_classification" or "multi_label_classification".
tokenizer_class (Union[str, ~tokenization_utils_base.PreTrainedTokenizerBase], optional) — The class name of model’s tokenizer.
hidden_size (int, optional, defaults to 1024) — Dimension of the hidden representations.
num_hidden_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer decoder.
num_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer decoder.
hidden_act (str, optional, defaults to gelu) — The non-linear activation function (function or string) in the decoder. For example, "gelu", "relu", "silu", etc.
hidden_dropout_prob (float, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-06) — The epsilon used by the layer normalization layers.
image_size (Union[int, list[int], tuple[int, int]], optional, defaults to 640) — The size (resolution) of each image.
patch_size (Union[int, list[int], tuple[int, int]], optional, defaults to 16) — The size (resolution) of each patch.
num_channels (int, optional, defaults to 3) — The number of input channels.
mlp_ratio (int, optional, defaults to 4) — Ratio of the MLP hidden dim to the embedding dim.
layerscale_value (float, optional, defaults to 1.0) — Initial value for the LayerScale parameter.
drop_path_rate (float, optional, defaults to 0.0) — Drop path rate for the patch fusion.
num_upscale_blocks (int, optional, defaults to 2) — Number of upsampling blocks used in the decoder or segmentation head.
attention_dropout (Union[float, int], optional, defaults to 0.0) — The dropout ratio for the attention probabilities.
use_swiglu_ffn (bool, optional, defaults to False) — Whether to use the SwiGLU feedforward neural network.
num_blocks (int, optional, defaults to 4) — Number of feature blocks or stages in the architecture.
no_object_weight (float, optional, defaults to 0.1) — Loss weight for the ‘no object’ class in panoptic/instance segmentation.
class_weight (float, optional, defaults to 2.0) — Loss weight for classification targets.
mask_weight (float, optional, defaults to 5.0) — Loss weight for mask prediction.
dice_weight (float, optional, defaults to 5.0) — Relative weight of the dice loss in the panoptic segmentation loss.
train_num_points (int, optional, defaults to 12544) — Number of points to sample for mask loss computation during training.
oversample_ratio (float, optional, defaults to 3.0) — Oversampling ratio used in point sampling for mask training.
importance_sample_ratio (float, optional, defaults to 0.75) — Ratio of points to sample based on importance during training.
num_queries (int, optional, defaults to 200) — Number of object queries in the Transformer.
num_register_tokens (int, optional, defaults to 4) — Number of learnable register tokens added to the transformer input.

This is the configuration class to store the configuration of a EomtModel. It is used to instantiate a Eomt model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the tue-mps/coco_panoptic_eomt_large_640

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Example:

>>> from transformers import EomtConfig, EomtForUniversalSegmentation

>>> # Initialize configuration
>>> config = EomtConfig()

>>> # Initialize model
>>> model = EomtForUniversalSegmentation(config)

>>> # Access config
>>> config = model.config

EomtForUniversalSegmentation

class transformers.EomtForUniversalSegmentation

< source >

( config: EomtConfig )

Parameters

config (EomtConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The EoMT Model with head on top for instance/semantic/panoptic segmentation.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: Tensor mask_labels: list[torch.Tensor] | None = None class_labels: list[torch.Tensor] | None = None patch_offsets: list[torch.Tensor] | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → EomtForUniversalSegmentationOutput or tuple(torch.FloatTensor)

Parameters

pixel_values (torch.Tensor of shape (batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using EomtImageProcessor. See EomtImageProcessor.__call__() for details (processor_class uses EomtImageProcessor for processing images).
mask_labels (list[torch.Tensor], optional) — list of mask labels of shape (num_labels, height, width) to be fed to a model
class_labels (list[torch.LongTensor], optional) — list of target class labels of shape (num_labels, height, width) to be fed to a model. They identify the labels of mask_labels, e.g. the label of mask_labels[i][j] if class_labels[i][j].
patch_offsets (list[torch.Tensor], optional) — list of tuples indicating the image index and start and end positions of patches for semantic segmentation.

Returns

EomtForUniversalSegmentationOutput or tuple(torch.FloatTensor)

A EomtForUniversalSegmentationOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (EomtConfig) and inputs.

The EomtForUniversalSegmentation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

loss (torch.Tensor, optional) — The computed loss, returned when labels are present.
class_queries_logits (torch.FloatTensor, optional, defaults to None) — A tensor of shape (batch_size, num_queries, num_labels + 1) representing the proposed classes for each query. Note the + 1 is needed because we incorporate the null class.
masks_queries_logits (torch.FloatTensor, optional, defaults to None) — A tensor of shape (batch_size, num_queries, height, width) representing the proposed masks for each query.
last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last layer.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states all layers of the model.
attentions (tuple(tuple(torch.FloatTensor)), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of tuple(torch.FloatTensor) (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Self and Cross Attentions weights from transformer decoder.
patch_offsets (list[torch.Tensor], optional) — list of tuples indicating the image index and start and end positions of patches for semantic segmentation.

Update on GitHub

←EfficientNet EoMT-DINOv3→