Transformers documentation
Model outputs
Model outputs
All models have outputs that are instances of subclasses of ModelOutput. Those are data structures containing all the information returned by the model, but that can also be used as tuples or dictionaries.
Let’s see how this looks in an example:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = model(**inputs, labels=labels)The outputs object is a SequenceClassifierOutput, as we can see in the
documentation of that class below, it means it has an optional loss, a logits an optional hidden_states and
an optional attentions attribute. Here we have the loss since we passed along labels, but we don’t have
hidden_states and attentions because we didn’t pass output_hidden_states=True or
output_attentions=True.
You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
will get None. Here for instance outputs.loss is the loss computed by the model, and outputs.attentions is
None.
When considering our outputs object as tuple, it only considers the attributes that don’t have None values.
Here for instance, it has two elements, loss then logits, so
outputs[:2]will return the tuple (outputs.loss, outputs.logits) for instance.
When considering our outputs object as dictionary, it only considers the attributes that don’t have None
values. Here for instance, it has two keys that are loss and logits.
We document here the generic model outputs that are used by more than one model type. Specific output types are documented on their corresponding model page.
ModelOutput
Base class for all model outputs as dataclass. Has a __getitem__ that allows indexing by integer or slice (like a
tuple) or strings (like a dictionary) that will ignore the None attributes. Otherwise behaves like a regular
python dictionary.
You can’t unpack a ModelOutput directly. Use the to_tuple() method to convert it to a tuple
before.
Convert self to a tuple containing all the attributes/keys that are not None.
BaseModelOutput
class transformers.modeling_outputs.BaseModelOutput
< source >( last_hidden_state: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model’s outputs, with potential hidden states and attentions.
BaseModelOutputWithPooling
class transformers.modeling_outputs.BaseModelOutputWithPooling
< source >( last_hidden_state: FloatTensor = None pooler_output: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
							pooler_output (torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model’s outputs that also contains a pooling of the last hidden states.
BaseModelOutputWithCrossAttentions
class transformers.modeling_outputs.BaseModelOutputWithCrossAttentions
< source >( last_hidden_state: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueandconfig.add_cross_attention=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
Base class for model’s outputs, with potential hidden states and attentions.
BaseModelOutputWithPoolingAndCrossAttentions
class transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions
< source >( last_hidden_state: FloatTensor = None pooler_output: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
							pooler_output (torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueandconfig.add_cross_attention=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally ifconfig.is_encoder_decoder=True2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.
Base class for model’s outputs that also contains a pooling of the last hidden states.
BaseModelOutputWithPast
class transformers.modeling_outputs.BaseModelOutputWithPast
< source >( last_hidden_state: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally ifconfig.is_encoder_decoder=True2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).
BaseModelOutputWithPastAndCrossAttentions
class transformers.modeling_outputs.BaseModelOutputWithPastAndCrossAttentions
< source >( last_hidden_state: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally ifconfig.is_encoder_decoder=True2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueandconfig.add_cross_attention=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).
Seq2SeqModelOutput
class transformers.modeling_outputs.Seq2SeqModelOutput
< source >( last_hidden_state: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the decoder of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs. 
- 
							decoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs. 
- 
							encoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model encoder’s outputs that also contains : pre-computed hidden states that can speed up sequential decoding.
CausalLMOutput
class transformers.modeling_outputs.CausalLMOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).
- 
							logits (torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for causal language model (or autoregressive) outputs.
CausalLMOutputWithCrossAttentions
class transformers.modeling_outputs.CausalLMOutputWithCrossAttentions
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).
- 
							logits (torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Cross attentions weights after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftorch.FloatTensortuples of lengthconfig.n_layers, with each tuple containing the cached key, value states of the self-attention and the cross-attention layers if model is used in encoder-decoder setting. Only relevant ifconfig.is_decoder = True.Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
Base class for causal language model (or autoregressive) outputs.
CausalLMOutputWithPast
class transformers.modeling_outputs.CausalLMOutputWithPast
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).
- 
							logits (torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head))Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for causal language model (or autoregressive) outputs.
MaskedLMOutput
class transformers.modeling_outputs.MaskedLMOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Masked language modeling (MLM) loss.
- 
							logits (torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for masked language models outputs.
Seq2SeqLMOutput
class transformers.modeling_outputs.Seq2SeqLMOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss.
- 
							logits (torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for sequence-to-sequence language models outputs.
NextSentencePredictorOutput
class transformers.modeling_outputs.NextSentencePredictorOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whennext_sentence_labelis provided) — Next sequence prediction (classification) loss.
- 
							logits (torch.FloatTensorof shape(batch_size, 2)) — Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of models predicting if two sentences are consecutive or not.
SequenceClassifierOutput
class transformers.modeling_outputs.SequenceClassifierOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.
- 
							logits (torch.FloatTensorof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of sentence classification models.
Seq2SeqSequenceClassifierOutput
class transformers.modeling_outputs.Seq2SeqSequenceClassifierOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelis provided) — Classification (or regression if config.num_labels==1) loss.
- 
							logits (torch.FloatTensorof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of sequence-to-sequence sentence classification models.
MultipleChoiceModelOutput
class transformers.modeling_outputs.MultipleChoiceModelOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape (1,), optional, returned whenlabelsis provided) — Classification loss.
- 
							logits (torch.FloatTensorof shape(batch_size, num_choices)) — num_choices is the second dimension of the input tensors. (see input_ids above).Classification scores (before SoftMax). 
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of multiple choice models.
TokenClassifierOutput
class transformers.modeling_outputs.TokenClassifierOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Classification loss.
- 
							logits (torch.FloatTensorof shape(batch_size, sequence_length, config.num_labels)) — Classification scores (before SoftMax).
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of token classification models.
QuestionAnsweringModelOutput
class transformers.modeling_outputs.QuestionAnsweringModelOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None start_logits: FloatTensor = None end_logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
- 
							start_logits (torch.FloatTensorof shape(batch_size, sequence_length)) — Span-start scores (before SoftMax).
- 
							end_logits (torch.FloatTensorof shape(batch_size, sequence_length)) — Span-end scores (before SoftMax).
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of question answering models.
Seq2SeqQuestionAnsweringModelOutput
class transformers.modeling_outputs.Seq2SeqQuestionAnsweringModelOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None start_logits: FloatTensor = None end_logits: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
- 
							start_logits (torch.FloatTensorof shape(batch_size, sequence_length)) — Span-start scores (before SoftMax).
- 
							end_logits (torch.FloatTensorof shape(batch_size, sequence_length)) — Span-end scores (before SoftMax).
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of sequence-to-sequence question answering models.
Seq2SeqSpectrogramOutput
class transformers.modeling_outputs.Seq2SeqSpectrogramOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None spectrogram: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Spectrogram generation loss.
- 
							spectrogram (torch.FloatTensorof shape(batch_size, sequence_length, num_bins)) — The predicted spectrogram.
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for sequence-to-sequence spectrogram outputs.
SemanticSegmenterOutput
class transformers.modeling_outputs.SemanticSegmenterOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.
- 
							logits (torch.FloatTensorof shape(batch_size, config.num_labels, logits_height, logits_width)) — Classification scores for each pixel.The logits returned do not necessarily have the same size as the pixel_valuespassed as inputs. This is to avoid doing two interpolations and lose some quality when a user needs to resize the logits to the original image size as post-processing. You should always check your logits shape and resize as needed.
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, patch_size, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, patch_size, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of semantic segmentation models.
ImageClassifierOutput
class transformers.modeling_outputs.ImageClassifierOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.
- 
							logits (torch.FloatTensorof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape(batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the model at the output of each stage.
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, patch_size, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of image classification models.
ImageClassifierOutputWithNoAttention
class transformers.modeling_outputs.ImageClassifierOutputWithNoAttention
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.
- 
							logits (torch.FloatTensorof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape(batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the model at the output of each stage.
Base class for outputs of image classification models.
DepthEstimatorOutput
class transformers.modeling_outputs.DepthEstimatorOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None predicted_depth: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.
- 
							predicted_depth (torch.FloatTensorof shape(batch_size, height, width)) — Predicted depth for each pixel.
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, num_channels, height, width).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, patch_size, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of depth estimation models.
Wav2Vec2BaseModelOutput
class transformers.modeling_outputs.Wav2Vec2BaseModelOutput
< source >( last_hidden_state: FloatTensor = None extract_features: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
							extract_features (torch.FloatTensorof shape(batch_size, sequence_length, conv_dim[-1])) — Sequence of extracted feature vectors of the last convolutional layer of the model.
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for models that have been trained with the Wav2Vec2 loss objective.
XVectorOutput
class transformers.modeling_outputs.XVectorOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None embeddings: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Classification loss.
- 
							logits (torch.FloatTensorof shape(batch_size, config.xvector_output_dim)) — Classification hidden states before AMSoftmax.
- 
							embeddings (torch.FloatTensorof shape(batch_size, config.xvector_output_dim)) — Utterance embeddings used for vector similarity-based retrieval.
- 
							hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Output type of Wav2Vec2ForXVector.
Seq2SeqTSModelOutput
class transformers.modeling_outputs.Seq2SeqTSModelOutput
< source >( last_hidden_state: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None loc: typing.Optional[torch.FloatTensor] = None scale: typing.Optional[torch.FloatTensor] = None static_features: typing.Optional[torch.FloatTensor] = None )
Parameters
- 
							last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the decoder of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs. 
- 
							decoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs. 
- 
							encoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							loc (torch.FloatTensorof shape(batch_size,)or(batch_size, input_size), optional) — Shift values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to shift back to the original magnitude.
- 
							scale (torch.FloatTensorof shape(batch_size,)or(batch_size, input_size), optional) — Scaling values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to rescale back to the original magnitude. static_features — (torch.FloatTensorof shape(batch_size, feature size), optional): Static features of each time series’ in a batch which are copied to the covariates at inference time.
Base class for time series model’s encoder outputs that also contains pre-computed hidden states that can speed up sequential decoding.
Seq2SeqTSPredictionOutput
class transformers.modeling_outputs.Seq2SeqTSPredictionOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None params: typing.Optional[typing.Tuple[torch.FloatTensor]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None loc: typing.Optional[torch.FloatTensor] = None scale: typing.Optional[torch.FloatTensor] = None static_features: typing.Optional[torch.FloatTensor] = None )
Parameters
- 
							loss (torch.FloatTensorof shape(1,), optional, returned when afuture_valuesis provided) — Distributional loss.
- 
							params (torch.FloatTensorof shape(batch_size, num_samples, num_params)) — Parameters of the chosen distribution.
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							loc (torch.FloatTensorof shape(batch_size,)or(batch_size, input_size), optional) — Shift values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to shift back to the original magnitude.
- 
							scale (torch.FloatTensorof shape(batch_size,)or(batch_size, input_size), optional) — Scaling values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to rescale back to the original magnitude. static_features — (torch.FloatTensorof shape(batch_size, feature size), optional): Static features of each time series’ in a batch which are copied to the covariates at inference time.
Base class for time series model’s decoder outputs that also contain the loss as well as the parameters of the chosen distribution.
SampleTSPredictionOutput
class transformers.modeling_outputs.SampleTSPredictionOutput
< source >( sequences: FloatTensor = None )
Base class for time series model’s predictions outputs that contains the sampled values from the chosen distribution.
TFBaseModelOutput
class transformers.modeling_tf_outputs.TFBaseModelOutput
< source >( last_hidden_state: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							last_hidden_state (tf.Tensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
							hidden_states (tuple(tf.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model’s outputs, with potential hidden states and attentions.
TFBaseModelOutputWithPooling
class transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling
< source >( last_hidden_state: Tensor = None pooler_output: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							last_hidden_state (tf.Tensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
							pooler_output (tf.Tensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence. 
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model’s outputs that also contains a pooling of the last hidden states.
TFBaseModelOutputWithPoolingAndCrossAttentions
class transformers.modeling_tf_outputs.TFBaseModelOutputWithPoolingAndCrossAttentions
< source >( last_hidden_state: Tensor = None pooler_output: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None cross_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							last_hidden_state (tf.Tensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
							pooler_output (tf.Tensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence. 
- 
							past_key_values (List[tf.Tensor], optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — List oftf.Tensorof lengthconfig.n_layers, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)).Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
Base class for model’s outputs that also contains a pooling of the last hidden states.
TFBaseModelOutputWithPast
class transformers.modeling_tf_outputs.TFBaseModelOutputWithPast
< source >( last_hidden_state: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							last_hidden_state (tf.Tensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
							past_key_values (List[tf.Tensor], optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — List oftf.Tensorof lengthconfig.n_layers, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)).Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).
TFBaseModelOutputWithPastAndCrossAttentions
class transformers.modeling_tf_outputs.TFBaseModelOutputWithPastAndCrossAttentions
< source >( last_hidden_state: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None cross_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							last_hidden_state (tf.Tensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
							past_key_values (List[tf.Tensor], optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — List oftf.Tensorof lengthconfig.n_layers, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)).Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							hidden_states (tuple(tf.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).
TFSeq2SeqModelOutput
class transformers.modeling_tf_outputs.TFSeq2SeqModelOutput
< source >( last_hidden_state: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None decoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None decoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None cross_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_last_hidden_state: typing.Optional[tensorflow.python.framework.ops.Tensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							last_hidden_state (tf.Tensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the decoder of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
							past_key_values (List[tf.Tensor], optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — List oftf.Tensorof lengthconfig.n_layers, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)).Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (tf.Tensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model encoder’s outputs that also contains : pre-computed hidden states that can speed up sequential decoding.
TFCausalLMOutput
class transformers.modeling_tf_outputs.TFCausalLMOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape(n,), optional, where n is the number of non-masked labels, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).
- 
							logits (tf.Tensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for causal language model (or autoregressive) outputs.
TFCausalLMOutputWithCrossAttentions
class transformers.modeling_tf_outputs.TFCausalLMOutputWithCrossAttentions
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None cross_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape(n,), optional, where n is the number of non-masked labels, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).
- 
							logits (tf.Tensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							past_key_values (List[tf.Tensor], optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — List oftf.Tensorof lengthconfig.n_layers, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)).Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
Base class for causal language model (or autoregressive) outputs.
TFCausalLMOutputWithPast
class transformers.modeling_tf_outputs.TFCausalLMOutputWithPast
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape(n,), optional, where n is the number of non-masked labels, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).
- 
							logits (tf.Tensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							past_key_values (List[tf.Tensor], optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — List oftf.Tensorof lengthconfig.n_layers, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)).Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for causal language model (or autoregressive) outputs.
TFMaskedLMOutput
class transformers.modeling_tf_outputs.TFMaskedLMOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape(n,), optional, where n is the number of non-masked labels, returned whenlabelsis provided) — Masked language modeling (MLM) loss.
- 
							logits (tf.Tensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for masked language models outputs.
TFSeq2SeqLMOutput
class transformers.modeling_tf_outputs.TFSeq2SeqLMOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None decoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None decoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None cross_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_last_hidden_state: typing.Optional[tensorflow.python.framework.ops.Tensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape(n,), optional, where n is the number of non-masked labels, returned whenlabelsis provided) — Language modeling loss.
- 
							logits (tf.Tensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							past_key_values (List[tf.Tensor], optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — List oftf.Tensorof lengthconfig.n_layers, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)).Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (tf.Tensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for sequence-to-sequence language models outputs.
TFNextSentencePredictorOutput
class transformers.modeling_tf_outputs.TFNextSentencePredictorOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape(n,), optional, where n is the number of non-masked labels, returned whennext_sentence_labelis provided) — Next sentence prediction loss.
- 
							logits (tf.Tensorof shape(batch_size, 2)) — Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of models predicting if two sentences are consecutive or not.
TFSequenceClassifierOutput
class transformers.modeling_tf_outputs.TFSequenceClassifierOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape(batch_size, ), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.
- 
							logits (tf.Tensorof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of sentence classification models.
TFSeq2SeqSequenceClassifierOutput
class transformers.modeling_tf_outputs.TFSeq2SeqSequenceClassifierOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None decoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None decoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None cross_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_last_hidden_state: typing.Optional[tensorflow.python.framework.ops.Tensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape(1,), optional, returned whenlabelis provided) — Classification (or regression if config.num_labels==1) loss.
- 
							logits (tf.Tensorof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
							past_key_values (List[tf.Tensor], optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — List oftf.Tensorof lengthconfig.n_layers, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)).Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
- 
							encoder_last_hidden_state (tf.Tensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of sequence-to-sequence sentence classification models.
TFMultipleChoiceModelOutput
class transformers.modeling_tf_outputs.TFMultipleChoiceModelOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape (batch_size, ), optional, returned whenlabelsis provided) — Classification loss.
- 
							logits (tf.Tensorof shape(batch_size, num_choices)) — num_choices is the second dimension of the input tensors. (see input_ids above).Classification scores (before SoftMax). 
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of multiple choice models.
TFTokenClassifierOutput
class transformers.modeling_tf_outputs.TFTokenClassifierOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape(n,), optional, where n is the number of unmasked labels, returned whenlabelsis provided) — Classification loss.
- 
							logits (tf.Tensorof shape(batch_size, sequence_length, config.num_labels)) — Classification scores (before SoftMax).
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of token classification models.
TFQuestionAnsweringModelOutput
class transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None start_logits: Tensor = None end_logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape(batch_size, ), optional, returned whenstart_positionsandend_positionsare provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
- 
							start_logits (tf.Tensorof shape(batch_size, sequence_length)) — Span-start scores (before SoftMax).
- 
							end_logits (tf.Tensorof shape(batch_size, sequence_length)) — Span-end scores (before SoftMax).
- 
							hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of question answering models.
TFSeq2SeqQuestionAnsweringModelOutput
class transformers.modeling_tf_outputs.TFSeq2SeqQuestionAnsweringModelOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None start_logits: Tensor = None end_logits: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None decoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None decoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_last_hidden_state: typing.Optional[tensorflow.python.framework.ops.Tensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
- 
							loss (tf.Tensorof shape(1,), optional, returned whenlabelsis provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
- 
							start_logits (tf.Tensorof shape(batch_size, sequence_length)) — Span-start scores (before SoftMax).
- 
							end_logits (tf.Tensorof shape(batch_size, sequence_length)) — Span-end scores (before SoftMax).
- 
							past_key_values (List[tf.Tensor], optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — List oftf.Tensorof lengthconfig.n_layers, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)).Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							encoder_last_hidden_state (tf.Tensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of sequence-to-sequence question answering models.
FlaxBaseModelOutput
class transformers.modeling_flax_outputs.FlaxBaseModelOutput
< source >( last_hidden_state: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							last_hidden_state (jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
							hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model’s outputs, with potential hidden states and attentions.
“Returns a new object replacing the specified fields with new values.
FlaxBaseModelOutputWithPast
class transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPast
< source >( last_hidden_state: ndarray = None past_key_values: typing.Union[typing.Dict[str, jax._src.numpy.ndarray.ndarray], NoneType] = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							last_hidden_state (jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
							past_key_values (Dict[str, jnp.ndarray]) — Dictionary of pre-computed hidden-states (key and values in the attention blocks) that can be used for fast auto-regressive decoding. Pre-computed key and value hidden-states are of shape [batch_size, max_length].
- 
							hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model’s outputs, with potential hidden states and attentions.
“Returns a new object replacing the specified fields with new values.
FlaxBaseModelOutputWithPooling
class transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling
< source >( last_hidden_state: ndarray = None pooler_output: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							last_hidden_state (jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
							pooler_output (jnp.ndarrayof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
- 
							hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model’s outputs that also contains a pooling of the last hidden states.
“Returns a new object replacing the specified fields with new values.
FlaxBaseModelOutputWithPastAndCrossAttentions
class transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions
< source >( last_hidden_state: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							last_hidden_state (jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
							past_key_values (tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally ifconfig.is_encoder_decoder=True2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.
- 
							hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueandconfig.add_cross_attention=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).
“Returns a new object replacing the specified fields with new values.
FlaxSeq2SeqModelOutput
class transformers.modeling_flax_outputs.FlaxSeq2SeqModelOutput
< source >( last_hidden_state: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None decoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_last_hidden_state: typing.Optional[jax._src.numpy.ndarray.ndarray] = None encoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							last_hidden_state (jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the decoder of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
							past_key_values (tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (jnp.ndarrayof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for model encoder’s outputs that also contains : pre-computed hidden states that can speed up sequential decoding.
“Returns a new object replacing the specified fields with new values.
FlaxCausalLMOutputWithCrossAttentions
class transformers.modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions
< source >( logits: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							logits (jnp.ndarrayof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Cross attentions weights after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							past_key_values (tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple ofjnp.ndarraytuples of lengthconfig.n_layers, with each tuple containing the cached key, value states of the self-attention and the cross-attention layers if model is used in encoder-decoder setting. Only relevant ifconfig.is_decoder = True.Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
Base class for causal language model (or autoregressive) outputs.
“Returns a new object replacing the specified fields with new values.
FlaxMaskedLMOutput
class transformers.modeling_flax_outputs.FlaxMaskedLMOutput
< source >( logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							logits (jnp.ndarrayof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for masked language models outputs.
“Returns a new object replacing the specified fields with new values.
FlaxSeq2SeqLMOutput
class transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput
< source >( logits: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None decoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_last_hidden_state: typing.Optional[jax._src.numpy.ndarray.ndarray] = None encoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							logits (jnp.ndarrayof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
							past_key_values (tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (jnp.ndarrayof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for sequence-to-sequence language models outputs.
“Returns a new object replacing the specified fields with new values.
FlaxNextSentencePredictorOutput
class transformers.modeling_flax_outputs.FlaxNextSentencePredictorOutput
< source >( logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							logits (jnp.ndarrayof shape(batch_size, 2)) — Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).
- 
							hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of models predicting if two sentences are consecutive or not.
“Returns a new object replacing the specified fields with new values.
FlaxSequenceClassifierOutput
class transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput
< source >( logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							logits (jnp.ndarrayof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
							hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of sentence classification models.
“Returns a new object replacing the specified fields with new values.
FlaxSeq2SeqSequenceClassifierOutput
class transformers.modeling_flax_outputs.FlaxSeq2SeqSequenceClassifierOutput
< source >( logits: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None decoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_last_hidden_state: typing.Optional[jax._src.numpy.ndarray.ndarray] = None encoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							logits (jnp.ndarrayof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
							past_key_values (tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (jnp.ndarrayof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of sequence-to-sequence sentence classification models.
“Returns a new object replacing the specified fields with new values.
FlaxMultipleChoiceModelOutput
class transformers.modeling_flax_outputs.FlaxMultipleChoiceModelOutput
< source >( logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							logits (jnp.ndarrayof shape(batch_size, num_choices)) — num_choices is the second dimension of the input tensors. (see input_ids above).Classification scores (before SoftMax). 
- 
							hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of multiple choice models.
“Returns a new object replacing the specified fields with new values.
FlaxTokenClassifierOutput
class transformers.modeling_flax_outputs.FlaxTokenClassifierOutput
< source >( logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							logits (jnp.ndarrayof shape(batch_size, sequence_length, config.num_labels)) — Classification scores (before SoftMax).
- 
							hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of token classification models.
“Returns a new object replacing the specified fields with new values.
FlaxQuestionAnsweringModelOutput
class transformers.modeling_flax_outputs.FlaxQuestionAnsweringModelOutput
< source >( start_logits: ndarray = None end_logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							start_logits (jnp.ndarrayof shape(batch_size, sequence_length)) — Span-start scores (before SoftMax).
- 
							end_logits (jnp.ndarrayof shape(batch_size, sequence_length)) — Span-end scores (before SoftMax).
- 
							hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
							attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of question answering models.
“Returns a new object replacing the specified fields with new values.
FlaxSeq2SeqQuestionAnsweringModelOutput
class transformers.modeling_flax_outputs.FlaxSeq2SeqQuestionAnsweringModelOutput
< source >( start_logits: ndarray = None end_logits: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None decoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_last_hidden_state: typing.Optional[jax._src.numpy.ndarray.ndarray] = None encoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
- 
							start_logits (jnp.ndarrayof shape(batch_size, sequence_length)) — Span-start scores (before SoftMax).
- 
							end_logits (jnp.ndarrayof shape(batch_size, sequence_length)) — Span-end scores (before SoftMax).
- 
							past_key_values (tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
							decoder_hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
							decoder_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
							cross_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
							encoder_last_hidden_state (jnp.ndarrayof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
							encoder_hidden_states (tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
							encoder_attentions (tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
Base class for outputs of sequence-to-sequence question answering models.
“Returns a new object replacing the specified fields with new values.