Transformers documentation
EfficientFormer
EfficientFormer
Overview
The EfficientFormer model was proposed in EfficientFormer: Vision Transformers at MobileNet Speed by Yanyu Li, Geng Yuan, Yang Wen, Eric Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. EfficientFormer proposes a dimension-consistent pure transformer that can be run on mobile devices for dense prediction tasks like image classification, object detection and semantic segmentation.
The abstract from the paper is the following:
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which { runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1),} and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.
This model was contributed by novice03 and Bearnardd. The original code can be found here. The TensorFlow version of this model was added by D-Roberts.
Documentation resources
EfficientFormerConfig
class transformers.EfficientFormerConfig
< source >( depths: List = [3, 2, 6, 4] hidden_sizes: List = [48, 96, 224, 448] downsamples: List = [True, True, True, True] dim: int = 448 key_dim: int = 32 attention_ratio: int = 4 resolution: int = 7 num_hidden_layers: int = 5 num_attention_heads: int = 8 mlp_expansion_ratio: int = 4 hidden_dropout_prob: float = 0.0 patch_size: int = 16 num_channels: int = 3 pool_size: int = 3 downsample_patch_size: int = 3 downsample_stride: int = 2 downsample_pad: int = 1 drop_path_rate: float = 0.0 num_meta3d_blocks: int = 1 distillation: bool = True use_layer_scale: bool = True layer_scale_init_value: float = 1e-05 hidden_act: str = 'gelu' initializer_range: float = 0.02 layer_norm_eps: float = 1e-12 image_size: int = 224 batch_norm_eps: float = 1e-05 **kwargs )
Parameters
-  depths (List(int), optional, defaults to[3, 2, 6, 4]) — Depth of each stage.
-  hidden_sizes (List(int), optional, defaults to[48, 96, 224, 448]) — Dimensionality of each stage.
-  downsamples (List(bool), optional, defaults to[True, True, True, True]) — Whether or not to downsample inputs between two stages.
-  dim (int, optional, defaults to 448) — Number of channels in Meta3D layers
-  key_dim (int, optional, defaults to 32) — The size of the key in meta3D block.
-  attention_ratio (int, optional, defaults to 4) — Ratio of the dimension of the query and value to the dimension of the key in MSHA block
-  resolution (int, optional, defaults to 7) — Size of each patch
-  num_hidden_layers (int, optional, defaults to 5) — Number of hidden layers in the Transformer encoder.
-  num_attention_heads (int, optional, defaults to 8) — Number of attention heads for each attention layer in the 3D MetaBlock.
-  mlp_expansion_ratio (int, optional, defaults to 4) — Ratio of size of the hidden dimensionality of an MLP to the dimensionality of its input.
-  hidden_dropout_prob (float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings and encoder.
-  patch_size (int, optional, defaults to 16) — The size (resolution) of each patch.
-  num_channels (int, optional, defaults to 3) — The number of input channels.
-  pool_size (int, optional, defaults to 3) — Kernel size of pooling layers.
-  downsample_patch_size (int, optional, defaults to 3) — The size of patches in downsampling layers.
-  downsample_stride (int, optional, defaults to 2) — The stride of convolution kernels in downsampling layers.
-  downsample_pad (int, optional, defaults to 1) — Padding in downsampling layers.
-  drop_path_rate (int, optional, defaults to 0) — Rate at which to increase dropout probability in DropPath.
-  num_meta3d_blocks (int, optional, defaults to 1) — The number of 3D MetaBlocks in the last stage.
-  distillation (bool, optional, defaults toTrue) — Whether to add a distillation head.
-  use_layer_scale (bool, optional, defaults toTrue) — Whether to scale outputs from token mixers.
-  layer_scale_init_value (float, optional, defaults to 1e-5) — Factor by which outputs from token mixers are scaled.
-  hidden_act (strorfunction, optional, defaults to"gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","selu"and"gelu_new"are supported.
-  initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-  layer_norm_eps (float, optional, defaults to 1e-12) — The epsilon used by the layer normalization layers.
-  image_size (int, optional, defaults to224) — The size (resolution) of each image.
This is the configuration class to store the configuration of an EfficientFormerModel. It is used to instantiate an EfficientFormer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the EfficientFormer snap-research/efficientformer-l1 architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import EfficientFormerConfig, EfficientFormerModel
>>> # Initializing a EfficientFormer efficientformer-l1 style configuration
>>> configuration = EfficientFormerConfig()
>>> # Initializing a EfficientFormerModel (with random weights) from the efficientformer-l3 style configuration
>>> model = EfficientFormerModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configEfficientFormerImageProcessor
class transformers.EfficientFormerImageProcessor
< source >( do_resize: bool = True size: Optional = None resample: Resampling = <Resampling.BICUBIC: 3> do_center_crop: bool = True do_rescale: bool = True rescale_factor: Union = 0.00392156862745098 crop_size: Dict = None do_normalize: bool = True image_mean: Union = None image_std: Union = None **kwargs )
Parameters
-  do_resize (bool, optional, defaults toTrue) — Whether to resize the image’s (height, width) dimensions to the specified(size["height"], size["width"]). Can be overridden by thedo_resizeparameter in thepreprocessmethod.
-  size (dict, optional, defaults to{"height" -- 224, "width": 224}): Size of the output image after resizing. Can be overridden by thesizeparameter in thepreprocessmethod.
-  resample (PILImageResampling, optional, defaults toPILImageResampling.BILINEAR) — Resampling filter to use if resizing the image. Can be overridden by theresampleparameter in thepreprocessmethod.
-  do_center_crop (bool, optional, defaults toTrue) — Whether to center crop the image to the specifiedcrop_size. Can be overridden bydo_center_cropin thepreprocessmethod.
-  crop_size (Dict[str, int]optional, defaults to 224) — Size of the output image after applyingcenter_crop. Can be overridden bycrop_sizein thepreprocessmethod.
-  do_rescale (bool, optional, defaults toTrue) — Whether to rescale the image by the specified scalerescale_factor. Can be overridden by thedo_rescaleparameter in thepreprocessmethod.
-  rescale_factor (intorfloat, optional, defaults to1/255) — Scale factor to use if rescaling the image. Can be overridden by therescale_factorparameter in thepreprocessmethod. do_normalize — Whether to normalize the image. Can be overridden by thedo_normalizeparameter in thepreprocessmethod.
-  image_mean (floatorList[float], optional, defaults toIMAGENET_STANDARD_MEAN) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_meanparameter in thepreprocessmethod.
-  image_std (floatorList[float], optional, defaults toIMAGENET_STANDARD_STD) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_stdparameter in thepreprocessmethod.
Constructs a EfficientFormer image processor.
preprocess
< source >( images: Union do_resize: Optional = None size: Dict = None resample: Resampling = None do_center_crop: bool = None crop_size: int = None do_rescale: Optional = None rescale_factor: Optional = None do_normalize: Optional = None image_mean: Union = None image_std: Union = None return_tensors: Union = None data_format: Union = <ChannelDimension.FIRST: 'channels_first'> input_data_format: Union = None **kwargs )
Parameters
-  images (ImageInput) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False.
-  do_resize (bool, optional, defaults toself.do_resize) — Whether to resize the image.
-  size (Dict[str, int], optional, defaults toself.size) — Dictionary in the format{"height": h, "width": w}specifying the size of the output image after resizing.
-  resample (PILImageResamplingfilter, optional, defaults toself.resample) —PILImageResamplingfilter to use if resizing the image e.g.PILImageResampling.BILINEAR. Only has an effect ifdo_resizeis set toTrue.
-  do_center_crop (bool, optional, defaults toself.do_center_crop) — Whether to center crop the image.
-  do_rescale (bool, optional, defaults toself.do_rescale) — Whether to rescale the image values between [0 - 1].
-  rescale_factor (float, optional, defaults toself.rescale_factor) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue.
-  crop_size (Dict[str, int], optional, defaults toself.crop_size) — Size of the center crop. Only has an effect ifdo_center_cropis set toTrue.
-  do_normalize (bool, optional, defaults toself.do_normalize) — Whether to normalize the image.
-  image_mean (floatorList[float], optional, defaults toself.image_mean) — Image mean to use ifdo_normalizeis set toTrue.
-  image_std (floatorList[float], optional, defaults toself.image_std) — Image standard deviation to use ifdo_normalizeis set toTrue.
-  return_tensors (strorTensorType, optional) — The type of tensors to return. Can be one of:- Unset: Return a list of np.ndarray.
- TensorType.TENSORFLOWor- 'tf': Return a batch of type- tf.Tensor.
- TensorType.PYTORCHor- 'pt': Return a batch of type- torch.Tensor.
- TensorType.NUMPYor- 'np': Return a batch of type- np.ndarray.
- TensorType.JAXor- 'jax': Return a batch of type- jax.numpy.ndarray.
 
- Unset: Return a list of 
-  data_format (ChannelDimensionorstr, optional, defaults toChannelDimension.FIRST) — The channel dimension format for the output image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
 
-  input_data_format (ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
Preprocess an image or batch of images.
EfficientFormerModel
class transformers.EfficientFormerModel
< source >( config: EfficientFormerConfig )
Parameters
- config (EfficientFormerConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare EfficientFormer Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None  ) → transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
-  pixel_values (torch.FloatTensorof shape(batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using ViTImageProcessor. See ViTImageProcessor.preprocess() for details.
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (EfficientFormerConfig) and inputs.
- 
last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
pooler_output ( torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The EfficientFormerModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, EfficientFormerModel
>>> import torch
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = EfficientFormerModel.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.no_grad():
...     outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 49, 448]EfficientFormerForImageClassification
class transformers.EfficientFormerForImageClassification
< source >( config: EfficientFormerConfig )
Parameters
- config (EfficientFormerConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
EfficientFormer Model transformer with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet.
This model is a PyTorch nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None  ) → transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)
Parameters
-  pixel_values (torch.FloatTensorof shape(batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using ViTImageProcessor. See ViTImageProcessor.preprocess() for details.
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
-  labels (torch.LongTensorof shape(batch_size,), optional) — Labels for computing the image classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels == 1a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1a classification loss is computed (Cross-Entropy).
Returns
transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.ImageClassifierOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (EfficientFormerConfig) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.
- 
logits ( torch.FloatTensorof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape(batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the model at the output of each stage.
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, patch_size, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The EfficientFormerForImageClassification forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, EfficientFormerForImageClassification
>>> import torch
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = EfficientFormerForImageClassification.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.no_grad():
...     logits = model(**inputs).logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
Egyptian catEfficientFormerForImageClassificationWithTeacher
class transformers.EfficientFormerForImageClassificationWithTeacher
< source >( config: EfficientFormerConfig )
Parameters
- config (EfficientFormerConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
EfficientFormer Model transformer with image classification heads on top (a linear layer on top of the final hidden state of the [CLS] token and a linear layer on top of the final hidden state of the distillation token) e.g. for ImageNet.
This model supports inference-only. Fine-tuning with distillation (i.e. with a teacher) is not yet supported.
This model is a PyTorch nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None  ) → transformers.models.efficientformer.modeling_efficientformer.EfficientFormerForImageClassificationWithTeacherOutput or tuple(torch.FloatTensor)
Parameters
-  pixel_values (torch.FloatTensorof shape(batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using ViTImageProcessor. See ViTImageProcessor.preprocess() for details.
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.models.efficientformer.modeling_efficientformer.EfficientFormerForImageClassificationWithTeacherOutput or tuple(torch.FloatTensor)
A transformers.models.efficientformer.modeling_efficientformer.EfficientFormerForImageClassificationWithTeacherOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (EfficientFormerConfig) and inputs.
- logits (torch.FloatTensorof shape(batch_size, config.num_labels)) — Prediction scores as the average of the cls_logits and distillation logits.
- cls_logits (torch.FloatTensorof shape(batch_size, config.num_labels)) — Prediction scores of the classification head (i.e. the linear layer on top of the final hidden state of the class token).
- distillation_logits (torch.FloatTensorof shape(batch_size, config.num_labels)) — Prediction scores of the distillation head (i.e. the linear layer on top of the final hidden state of the distillation token).
- hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
- attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The EfficientFormerForImageClassificationWithTeacher forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, EfficientFormerForImageClassificationWithTeacher
>>> import torch
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = EfficientFormerForImageClassificationWithTeacher.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.no_grad():
...     logits = model(**inputs).logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
Egyptian catTFEfficientFormerModel
class transformers.TFEfficientFormerModel
< source >( config: EfficientFormerConfig **kwargs )
Parameters
- config (EfficientFormerConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare EfficientFormer Model transformer outputting raw hidden-states without any specific head on top. This model is a TensorFlow keras.layers.Layer. Use it as a regular TensorFlow Module and refer to the TensorFlow documentation for all matter related to general usage and behavior.
call
< source >( pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None training: bool = False  ) → transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or tuple(tf.Tensor)
Parameters
-  pixel_values ((tf.Tensorof shape(batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using AutoImageProcessor. See EfficientFormerImageProcessor.call() for details.
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or a tuple of tf.Tensor (if
return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the
configuration (EfficientFormerConfig) and inputs.
- 
last_hidden_state ( tf.Tensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
pooler_output ( tf.Tensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence. 
- 
hidden_states ( tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The TFEfficientFormerModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, TFEfficientFormerModel
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = TFEfficientFormerModel.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="tf")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 49, 448]TFEfficientFormerForImageClassification
class transformers.TFEfficientFormerForImageClassification
< source >( config: EfficientFormerConfig )
Parameters
- config (EfficientFormerConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
EfficientFormer Model transformer with an image classification head on top of pooled last hidden state, e.g. for ImageNet.
This model is a TensorFlow keras.layers.Layer. Use it as a regular TensorFlow Module and refer to the TensorFlow documentation for all matter related to general usage and behavior.
call
< source >( pixel_values: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None training: bool = False  ) → transformers.modeling_tf_outputs.TFImageClassifierOutput or tuple(tf.Tensor)
Parameters
-  pixel_values ((tf.Tensorof shape(batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using AutoImageProcessor. See EfficientFormerImageProcessor.call() for details.
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
-  labels (tf.Tensorof shape(batch_size,), optional) — Labels for computing the image classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels == 1a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1a classification loss is computed (Cross-Entropy).
Returns
transformers.modeling_tf_outputs.TFImageClassifierOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFImageClassifierOutput or a tuple of tf.Tensor (if
return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the
configuration (EfficientFormerConfig) and inputs.
- 
loss ( tf.Tensorof shape(1,), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.
- 
logits ( tf.Tensorof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
hidden_states ( tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape(batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the model at the output of each stage.
- 
attentions ( tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, patch_size, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The TFEfficientFormerForImageClassification forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, TFEfficientFormerForImageClassification
>>> import tensorflow as tf
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = TFEfficientFormerForImageClassification.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="tf")
>>> logits = model(**inputs).logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = int(tf.math.argmax(logits, axis=-1))
>>> print(model.config.id2label[predicted_label])
LABEL_281TFEfficientFormerForImageClassificationWithTeacher
class transformers.TFEfficientFormerForImageClassificationWithTeacher
< source >( config: EfficientFormerConfig )
Parameters
- config (EfficientFormerConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
EfficientFormer Model transformer with image classification heads on top (a linear layer on top of the final hidden state and a linear layer on top of the final hidden state of the distillation token) e.g. for ImageNet.
.. warning:: This model supports inference-only. Fine-tuning with distillation (i.e. with a teacher) is not yet supported.
This model is a TensorFlow keras.layers.Layer. Use it as a regular TensorFlow Module and refer to the TensorFlow documentation for all matter related to general usage and behavior.
call
< source >( pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None training: bool = False  ) → transformers.models.efficientformer.modeling_tf_efficientformer.TFEfficientFormerForImageClassificationWithTeacherOutput or tuple(tf.Tensor)
Parameters
-  pixel_values ((tf.Tensorof shape(batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using AutoImageProcessor. See EfficientFormerImageProcessor.call() for details.
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.models.efficientformer.modeling_tf_efficientformer.TFEfficientFormerForImageClassificationWithTeacherOutput or tuple(tf.Tensor)
A transformers.models.efficientformer.modeling_tf_efficientformer.TFEfficientFormerForImageClassificationWithTeacherOutput or a tuple of tf.Tensor (if
return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the
configuration (EfficientFormerConfig) and inputs.
The TFEfficientFormerForImageClassificationWithTeacher forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
- Output type of EfficientFormerForImageClassificationWithTeacher.
logits (tf.Tensorof shape(batch_size, config.num_labels)) — Prediction scores as the average of the cls_logits and distillation logits. cls_logits (tf.Tensorof shape(batch_size, config.num_labels)) — Prediction scores of the classification head (i.e. the linear layer on top of the final hidden state of the class token). distillation_logits (tf.Tensorof shape(batch_size, config.num_labels)) — Prediction scores of the distillation head (i.e. the linear layer on top of the final hidden state of the distillation token). hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs. attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Example:
>>> from transformers import AutoImageProcessor, TFEfficientFormerForImageClassificationWithTeacher
>>> import tensorflow as tf
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = TFEfficientFormerForImageClassificationWithTeacher.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="tf")
>>> logits = model(**inputs).logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = int(tf.math.argmax(logits, axis=-1))
>>> print(model.config.id2label[predicted_label])
LABEL_281