Skip to content

The models module

This module contains the models that can be used as the four main components that will comprise a Wide and Deep model (wide, deeptabular, deeptext, deepimage), as well as the WideDeep "constructor" class. Note that each of the four components can be used independently. It also contains all the documentation for the models that can be used for self-supervised pre-training with tabular data.

Wide

Bases: Module

Defines a Wide (linear) model where the non-linearities are captured via the so-called crossed-columns. This can be used as the wide component of a Wide & Deep model.

Parameters:

Name Type Description Default
input_dim int

size of the Linear layer (implemented via an Embedding layer). input_dim is the summation of all the individual values for all the features that go through the wide model. For example, if the wide model receives 2 features with 5 individual values each, input_dim = 10

required
pred_dim int

size of the ouput tensor containing the predictions. Note that unlike all the other models, the wide model is connected directly to the output neuron(s) when used to build a Wide and Deep model. Therefore, it requires the pred_dim parameter.

1

Attributes:

Name Type Description
wide_linear Module

the linear layer that comprises the wide branch of the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import Wide
>>> X = torch.empty(4, 4).random_(20)
>>> wide = Wide(input_dim=int(X.max().item()), pred_dim=1)
>>> out = wide(X)
Source code in pytorch_widedeep/models/tabular/linear/wide.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
class Wide(nn.Module):
    r"""Defines a `Wide` (linear) model where the non-linearities are
    captured via the so-called crossed-columns. This can be used as the
    `wide` component of a Wide & Deep model.

    Parameters
    -----------
    input_dim: int
        size of the Linear layer (implemented via an Embedding layer).
        `input_dim` is the summation of all the individual values for all the
        features that go through the wide model. For example, if the wide
        model receives 2 features with 5 individual values each, `input_dim =
        10`
    pred_dim: int, default = 1
        size of the ouput tensor containing the predictions. Note that unlike
        all the other models, the wide model is connected directly to the
        output neuron(s) when used to build a Wide and Deep model. Therefore,
        it requires the `pred_dim` parameter.

    Attributes
    -----------
    wide_linear: nn.Module
        the linear layer that comprises the wide branch of the model

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import Wide
    >>> X = torch.empty(4, 4).random_(20)
    >>> wide = Wide(input_dim=int(X.max().item()), pred_dim=1)
    >>> out = wide(X)
    """

    @alias("pred_dim", ["pred_size", "num_class"])
    def __init__(self, input_dim: int, pred_dim: int = 1):
        super(Wide, self).__init__()

        self.input_dim = input_dim
        self.pred_dim = pred_dim

        # Embeddings: val + 1 because 0 is reserved for padding/unseen cateogories.
        self.wide_linear = nn.Embedding(input_dim + 1, pred_dim, padding_idx=0)
        # (Sum(Embedding) + bias) is equivalent to (OneHotVector + Linear)
        self.bias = nn.Parameter(torch.zeros(pred_dim))
        self._reset_parameters()

    def _reset_parameters(self) -> None:
        r"""initialize Embedding and bias like nn.Linear. See [original
        implementation](https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear).
        """
        nn.init.kaiming_uniform_(self.wide_linear.weight, a=math.sqrt(5))
        fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.wide_linear.weight)
        bound = 1 / math.sqrt(fan_in)
        nn.init.uniform_(self.bias, -bound, bound)

    def forward(self, X: Tensor) -> Tensor:
        r"""Forward pass. Simply connecting the Embedding layer with the ouput
        neuron(s)"""
        out = self.wide_linear(X.long()).sum(dim=1) + self.bias
        return out

forward

forward(X)

Forward pass. Simply connecting the Embedding layer with the ouput neuron(s)

Source code in pytorch_widedeep/models/tabular/linear/wide.py
65
66
67
68
69
def forward(self, X: Tensor) -> Tensor:
    r"""Forward pass. Simply connecting the Embedding layer with the ouput
    neuron(s)"""
    out = self.wide_linear(X.long()).sum(dim=1) + self.bias
    return out

TabMlp

Bases: BaseTabularModelWithoutAttention

Defines a TabMlp model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class combines embedding representations of the categorical features with numerical (aka continuous) features, embedded or not. These are then passed through a series of dense layers (i.e. a MLP).

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
cat_embed_input Optional[List[Tuple[str, int, int]]]

List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal[batchnorm, layernorm]]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous Optional[bool]

Boolean indicating if the continuous columns will be embedded using one of the available methods: 'standard', 'periodic' or 'piecewise'. If None, it will default to 'False'.
ℹ️ NOTE: This parameter is deprecated and it will be removed in future releases. Please, use the embed_continuous_method parameter instead.

None
embed_continuous_method Optional[Literal[standard, piecewise, periodic]]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

None
cont_embed_dim Optional[int]

Size of the continuous embeddings. If the continuous columns are embedded, cont_embed_dim must be passed.

None
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
mlp_hidden_dims List[int]

List with the number of neurons per dense layer in the mlp.

[200, 100]
mlp_activation str

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
mlp_dropout Union[float, List[float]]

float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

0.1
mlp_batchnorm bool

Boolean indicating whether or not batch normalization will be applied to the dense layers

False
mlp_batchnorm_last bool

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

False
mlp_linear_first bool

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

True

Attributes:

Name Type Description
encoder Module

mlp model that will receive the concatenation of the embeddings and the continuous columns

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabMlp
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ["a", "b", "c", "d", "e"]
>>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
>>> column_idx = {k: v for v, k in enumerate(colnames)}
>>> model = TabMlp(mlp_hidden_dims=[8, 4], column_idx=column_idx, cat_embed_input=cat_embed_input,
... continuous_cols=["e"])
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/mlp/tab_mlp.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
class TabMlp(BaseTabularModelWithoutAttention):
    r"""Defines a `TabMlp` model that can be used as the `deeptabular`
    component of a Wide & Deep model or independently by itself.

    This class combines embedding representations of the categorical features
    with numerical (aka continuous) features, embedded or not. These are then
    passed through a series of dense layers (i.e. a MLP).

    Most of the parameters for this class are `Optional` since the use of
    categorical or continuous is in fact optional (i.e. one can use
    categorical features only, continuous features only or both).

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `TabMlp` model. Required to slice the tensors. e.g. _{'education':
        0, 'relationship': 1, 'workclass': 2, ...}_.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name, number of unique values and
        embedding dimension. e.g. _[(education, 11, 32), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    continuous_cols: List, Optional, default = None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer: str, Optional, default =  None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous: bool, Optional, default = None,
        Boolean indicating if the continuous columns will be embedded using
        one of the available methods: _'standard'_, _'periodic'_
        or _'piecewise'_. If `None`, it will default to 'False'.<br/>
        :information_source: **NOTE**: This parameter is deprecated and it
         will be removed in future releases. Please, use the
         `embed_continuous_method` parameter instead.
    embed_continuous_method: Optional, str, default = None,
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dim: int, Optional, default = None,
        Size of the continuous embeddings. If the continuous columns are
        embedded, `cont_embed_dim` must be passed.
    cont_embed_dropout: float, Optional, default = None,
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation: Optional, str, default = None,
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup: Dict[str, List[float]], Optional, default = None,
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies: int, Optional, default = None,
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma: float, Optional, default = None,
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer: bool, Optional, default = None,
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    mlp_hidden_dims: List, default = [200, 100]
        List with the number of neurons per dense layer in the mlp.
    mlp_activation: str, default = "relu"
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    mlp_dropout: float or List, default = 0.1
        float or List of floats with the dropout between the dense layers.
        e.g: _[0.5,0.5]_
    mlp_batchnorm: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
    mlp_batchnorm_last: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
    mlp_linear_first: bool, default = False
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    encoder: nn.Module
        mlp model that will receive the concatenation of the embeddings and
        the continuous columns

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import TabMlp
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ["a", "b", "c", "d", "e"]
    >>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
    >>> column_idx = {k: v for v, k in enumerate(colnames)}
    >>> model = TabMlp(mlp_hidden_dims=[8, 4], column_idx=column_idx, cat_embed_input=cat_embed_input,
    ... continuous_cols=["e"])
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        column_idx: Dict[str, int],
        *,
        cat_embed_input: Optional[List[Tuple[str, int, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous: Optional[bool] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = None,
        cont_embed_dim: Optional[int] = None,
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        mlp_hidden_dims: List[int] = [200, 100],
        mlp_activation: str = "relu",
        mlp_dropout: Union[float, List[float]] = 0.1,
        mlp_batchnorm: bool = False,
        mlp_batchnorm_last: bool = False,
        mlp_linear_first: bool = True,
    ):
        super(TabMlp, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous=embed_continuous,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dim=cont_embed_dim,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        # Embeddings are instantiated at the base model
        # Mlp
        mlp_input_dim = self.cat_out_dim + self.cont_out_dim
        mlp_hidden_dims = [mlp_input_dim] + mlp_hidden_dims
        self.encoder = MLP(
            mlp_hidden_dims,
            mlp_activation,
            mlp_dropout,
            mlp_batchnorm,
            mlp_batchnorm_last,
            mlp_linear_first,
        )

    def forward(self, X: Tensor) -> Tensor:
        x = self._get_embeddings(X)
        return self.encoder(x)

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class"""
        return self.mlp_hidden_dims[-1]

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

TabMlpDecoder

Bases: Module

Companion decoder model for the TabMlp model (which can be considered an encoder itself).

This class is designed to be used with the EncoderDecoderTrainer when using self-supervised pre-training (see the corresponding section in the docs). The TabMlpDecoder will receive the output from the MLP and 'reconstruct' the embeddings.

Parameters:

Name Type Description Default
embed_dim int

Size of the embeddings tensor that needs to be reconstructed.

required
mlp_hidden_dims List[int]

List with the number of neurons per dense layer in the mlp.

[100, 200]
mlp_activation str

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
mlp_dropout Union[float, List[float]]

float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

0.1
mlp_batchnorm bool

Boolean indicating whether or not batch normalization will be applied to the dense layers

False
mlp_batchnorm_last bool

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

False
mlp_linear_first bool

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

True

Attributes:

Name Type Description
decoder Module

mlp model that will receive the output of the encoder

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabMlpDecoder
>>> x_inp = torch.rand(3, 8)
>>> decoder = TabMlpDecoder(embed_dim=32, mlp_hidden_dims=[8,16])
>>> res = decoder(x_inp)
>>> res.shape
torch.Size([3, 32])
Source code in pytorch_widedeep/models/tabular/mlp/tab_mlp.py
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
class TabMlpDecoder(nn.Module):
    r"""Companion decoder model for the `TabMlp` model (which can be considered
    an encoder itself).

    This class is designed to be used with the `EncoderDecoderTrainer` when
    using self-supervised pre-training (see the corresponding section in the
    docs). The `TabMlpDecoder` will receive the output from the MLP
    and '_reconstruct_' the embeddings.

    Parameters
    ----------
    embed_dim: int
        Size of the embeddings tensor that needs to be reconstructed.
    mlp_hidden_dims: List, default = [200, 100]
        List with the number of neurons per dense layer in the mlp.
    mlp_activation: str, default = "relu"
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    mlp_dropout: float or List, default = 0.1
        float or List of floats with the dropout between the dense layers.
        e.g: _[0.5,0.5]_
    mlp_batchnorm: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
    mlp_batchnorm_last: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
    mlp_linear_first: bool, default = False
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    decoder: nn.Module
        mlp model that will receive the output of the encoder

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import TabMlpDecoder
    >>> x_inp = torch.rand(3, 8)
    >>> decoder = TabMlpDecoder(embed_dim=32, mlp_hidden_dims=[8,16])
    >>> res = decoder(x_inp)
    >>> res.shape
    torch.Size([3, 32])
    """

    def __init__(
        self,
        embed_dim: int,
        mlp_hidden_dims: List[int] = [100, 200],
        mlp_activation: str = "relu",
        mlp_dropout: Union[float, List[float]] = 0.1,
        mlp_batchnorm: bool = False,
        mlp_batchnorm_last: bool = False,
        mlp_linear_first: bool = True,
    ):
        super(TabMlpDecoder, self).__init__()

        self.embed_dim = embed_dim

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        self.decoder = MLP(
            mlp_hidden_dims + [self.embed_dim],
            mlp_activation,
            mlp_dropout,
            mlp_batchnorm,
            mlp_batchnorm_last,
            mlp_linear_first,
        )

    def forward(self, X: Tensor) -> Tensor:
        return self.decoder(X)

TabResnet

Bases: BaseTabularModelWithoutAttention

Defines a TabResnet model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class combines embedding representations of the categorical features with numerical (aka continuous) features, embedded or not. These are then passed through a series of Resnet blocks. See pytorch_widedeep.models.tab_resnet._layers for details on the structure of each block.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
cat_embed_input Optional[List[Tuple[str, int, int]]]

List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal[batchnorm, layernorm]]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous Optional[bool]

Boolean indicating if the continuous columns will be embedded using one of the available methods: 'standard', 'periodic' or 'piecewise'. If None, it will default to 'False'.
ℹ️ NOTE: This parameter is deprecated and it will be removed in future releases. Please, use the embed_continuous_method parameter instead.

None
embed_continuous_method Optional[Literal[standard, piecewise, periodic]]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

None
cont_embed_dim Optional[int]

Size of the continuous embeddings. If the continuous columns are embedded, cont_embed_dim must be passed.

None
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
blocks_dims List[int]

List of integers that define the input and output units of each block. For example: [200, 100, 100] will generate 2 blocks. The first will receive a tensor of size 200 and output a tensor of size 100, and the second will receive a tensor of size 100 and output a tensor of size 100. See pytorch_widedeep.models.tab_resnet._layers for details on the structure of each block.

[200, 100, 100]
blocks_dropout float

Block's internal dropout.

0.1
simplify_blocks bool

Boolean indicating if the simplest possible residual blocks (X -> [ [LIN, BN, ACT] + X ]) will be used instead of a standard one (X -> [ [LIN1, BN1, ACT1] -> [LIN2, BN2] + X ]).

False
mlp_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If None the output of the Resnet Blocks will be connected directly to the output neuron(s).

None
mlp_activation Optional[str]

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

None
mlp_dropout Optional[float]

float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

None
mlp_batchnorm Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_batchnorm_last Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_linear_first Optional[bool]

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

None

Attributes:

Name Type Description
encoder Module

deep dense Resnet model that will receive the concatenation of the embeddings and the continuous columns

mlp Module

if mlp_hidden_dims is True, this attribute will be an mlp model that will receive the results of the concatenation of the embeddings and the continuous columns -- if present --.

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabResnet
>>> X_deep = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabResnet(blocks_dims=[16,4], column_idx=column_idx, cat_embed_input=cat_embed_input,
... continuous_cols = ['e'])
>>> out = model(X_deep)
Source code in pytorch_widedeep/models/tabular/resnet/tab_resnet.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
class TabResnet(BaseTabularModelWithoutAttention):
    r"""Defines a `TabResnet` model that can be used as the `deeptabular`
    component of a Wide & Deep model or independently by itself.

    This class combines embedding representations of the categorical features
    with numerical (aka continuous) features, embedded or not. These are then
    passed through a series of Resnet blocks. See
    `pytorch_widedeep.models.tab_resnet._layers` for details on the
    structure of each block.

    Most of the parameters for this class are `Optional` since the use of
    categorical or continuous is in fact optional (i.e. one can use
    categorical features only, continuous features only or both).

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `TabMlp` model. Required to slice the tensors. e.g. _{'education':
        0, 'relationship': 1, 'workclass': 2, ...}_.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name, number of unique values and
        embedding dimension. e.g. _[(education, 11, 32), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    continuous_cols: List, Optional, default = None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer: str, Optional, default =  None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous: bool, Optional, default = None,
        Boolean indicating if the continuous columns will be embedded using
        one of the available methods: _'standard'_, _'periodic'_
        or _'piecewise'_. If `None`, it will default to 'False'.<br/>
        :information_source: **NOTE**: This parameter is deprecated and it
         will be removed in future releases. Please, use the
         `embed_continuous_method` parameter instead.
    embed_continuous_method: Optional, str, default = None,
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dim: int, Optional, default = None,
        Size of the continuous embeddings. If the continuous columns are
        embedded, `cont_embed_dim` must be passed.
    cont_embed_dropout: float, Optional, default = None,
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation: Optional, str, default = None,
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup: Dict[str, List[float]], Optional, default = None,
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies: int, Optional, default = None,
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma: float, Optional, default = None,
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer: bool, Optional, default = None,
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    blocks_dims: List, default = [200, 100, 100]
        List of integers that define the input and output units of each block.
        For example: _[200, 100, 100]_ will generate 2 blocks. The first will
        receive a tensor of size 200 and output a tensor of size 100, and the
        second will receive a tensor of size 100 and output a tensor of size
        100. See `pytorch_widedeep.models.tab_resnet._layers` for
        details on the structure of each block.
    blocks_dropout: float, default =  0.1
        Block's internal dropout.
    simplify_blocks: bool, default = False,
        Boolean indicating if the simplest possible residual blocks (`X -> [
        [LIN, BN, ACT]  + X ]`) will be used instead of a standard one
        (`X -> [ [LIN1, BN1, ACT1] -> [LIN2, BN2]  + X ]`).
    mlp_hidden_dims: List, Optional, default = None
        List with the number of neurons per dense layer in the MLP. e.g:
        _[64, 32]_. If `None` the  output of the Resnet Blocks will be
        connected directly to the output neuron(s).
    mlp_activation: str, Optional, default = None
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky'_relu' and _'gelu'_ are supported.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to _'relu'_.
    mlp_dropout: float, Optional, default = None
        float with the dropout between the dense layers of the MLP.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to 0.0.
    mlp_batchnorm: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_batchnorm_last: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_linear_first: bool, Optional, default = None
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to `True`.

    Attributes
    ----------
    encoder: nn.Module
        deep dense Resnet model that will receive the concatenation of the
        embeddings and the continuous columns
    mlp: nn.Module
        if `mlp_hidden_dims` is `True`, this attribute will be an mlp
        model that will receive the results of the concatenation of the
        embeddings and the continuous columns -- if present --.

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import TabResnet
    >>> X_deep = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ['a', 'b', 'c', 'd', 'e']
    >>> cat_embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
    >>> column_idx = {k:v for v,k in enumerate(colnames)}
    >>> model = TabResnet(blocks_dims=[16,4], column_idx=column_idx, cat_embed_input=cat_embed_input,
    ... continuous_cols = ['e'])
    >>> out = model(X_deep)
    """

    def __init__(
        self,
        column_idx: Dict[str, int],
        *,
        cat_embed_input: Optional[List[Tuple[str, int, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous: Optional[bool] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = None,
        cont_embed_dim: Optional[int] = None,
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        blocks_dims: List[int] = [200, 100, 100],
        blocks_dropout: float = 0.1,
        simplify_blocks: bool = False,
        mlp_hidden_dims: Optional[List[int]] = None,
        mlp_activation: Optional[str] = None,
        mlp_dropout: Optional[float] = None,
        mlp_batchnorm: Optional[bool] = None,
        mlp_batchnorm_last: Optional[bool] = None,
        mlp_linear_first: Optional[bool] = None,
    ):
        super(TabResnet, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous=embed_continuous,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dim=cont_embed_dim,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        if len(blocks_dims) < 2:
            raise ValueError(
                "'blocks' must contain at least two elements, e.g. [256, 128]"
            )

        self.blocks_dims = blocks_dims
        self.blocks_dropout = blocks_dropout
        self.simplify_blocks = simplify_blocks

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        # Embeddings are instantiated at the base model

        # Resnet
        dense_resnet_input_dim = self.cat_out_dim + self.cont_out_dim
        self.encoder = DenseResnet(
            dense_resnet_input_dim, blocks_dims, blocks_dropout, self.simplify_blocks
        )

        # Mlp: adding an MLP on top of the Resnet blocks is optional and
        # therefore all related params are optional
        if self.mlp_hidden_dims is not None:
            self.mlp = MLP(
                d_hidden=[self.blocks_dims[-1]] + self.mlp_hidden_dims,
                activation=(
                    "relu" if self.mlp_activation is None else self.mlp_activation
                ),
                dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
                batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
                batchnorm_last=(
                    False
                    if self.mlp_batchnorm_last is None
                    else self.mlp_batchnorm_last
                ),
                linear_first=(
                    True if self.mlp_linear_first is None else self.mlp_linear_first
                ),
            )
        else:
            self.mlp = None

    def forward(self, X: Tensor) -> Tensor:
        x = self._get_embeddings(X)
        x = self.encoder(x)
        if self.mlp is not None:
            x = self.mlp(x)
        return x

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class
        """
        return (
            self.mlp_hidden_dims[-1]
            if self.mlp_hidden_dims is not None
            else self.blocks_dims[-1]
        )

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

TabResnetDecoder

Bases: Module

Companion decoder model for the TabResnet model (which can be considered an encoder itself)

This class is designed to be used with the EncoderDecoderTrainer when using self-supervised pre-training (see the corresponding section in the docs). This class will receive the output from the ResNet blocks or the MLP(if present) and 'reconstruct' the embeddings.

Parameters:

Name Type Description Default
embed_dim int

Size of the embeddings tensor to be reconstructed.

required
blocks_dims List[int]

List of integers that define the input and output units of each block. For example: [200, 100, 100] will generate 2 blocks. The first will receive a tensor of size 200 and output a tensor of size 100, and the second will receive a tensor of size 100 and output a tensor of size 100. See pytorch_widedeep.models.tab_resnet._layers for details on the structure of each block.

[100, 100, 200]
blocks_dropout float

Block's internal dropout.

0.1
simplify_blocks bool

Boolean indicating if the simplest possible residual blocks (X -> [ [LIN, BN, ACT] + X ]) will be used instead of a standard one (X -> [ [LIN1, BN1, ACT1] -> [LIN2, BN2] + X ]).

False
mlp_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If None the output of the Resnet Blocks will be connected directly to the output neuron(s).

None
mlp_activation Optional[str]

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

None
mlp_dropout Optional[float]

float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

None
mlp_batchnorm Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_batchnorm_last Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_linear_first Optional[bool]

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

None

Attributes:

Name Type Description
decoder Module

deep dense Resnet model that will receive the output of the encoder IF mlp_hidden_dims is None

mlp Module

if mlp_hidden_dims is not None, the overall decoder will consist in an MLP that will receive the output of the encoder followed by the deep dense Resnet.

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabResnetDecoder
>>> x_inp = torch.rand(3, 8)
>>> decoder = TabResnetDecoder(embed_dim=32, blocks_dims=[8, 16, 16])
>>> res = decoder(x_inp)
>>> res.shape
torch.Size([3, 32])
Source code in pytorch_widedeep/models/tabular/resnet/tab_resnet.py
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
class TabResnetDecoder(nn.Module):
    r"""Companion decoder model for the `TabResnet` model (which can be
    considered an encoder itself)

    This class is designed to be used with the `EncoderDecoderTrainer` when
    using self-supervised pre-training (see the corresponding section in the
    docs). This class will receive the output from the ResNet blocks or the
    MLP(if present) and '_reconstruct_' the embeddings.

    Parameters
    ----------
    embed_dim: int
        Size of the embeddings tensor to be reconstructed.
    blocks_dims: List, default = [200, 100, 100]
        List of integers that define the input and output units of each block.
        For example: _[200, 100, 100]_ will generate 2 blocks. The first will
        receive a tensor of size 200 and output a tensor of size 100, and the
        second will receive a tensor of size 100 and output a tensor of size
        100. See `pytorch_widedeep.models.tab_resnet._layers` for
        details on the structure of each block.
    blocks_dropout: float, default =  0.1
        Block's internal dropout.
    simplify_blocks: bool, default = False,
        Boolean indicating if the simplest possible residual blocks (`X -> [
        [LIN, BN, ACT]  + X ]`) will be used instead of a standard one
        (`X -> [ [LIN1, BN1, ACT1] -> [LIN2, BN2]  + X ]`).
    mlp_hidden_dims: List, Optional, default = None
        List with the number of neurons per dense layer in the MLP. e.g:
        _[64, 32]_. If `None` the  output of the Resnet Blocks will be
        connected directly to the output neuron(s).
    mlp_activation: str, Optional, default = None
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky'_relu' and _'gelu'_ are supported.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to _'relu'_.
    mlp_dropout: float, Optional, default = None
        float with the dropout between the dense layers of the MLP.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to 0.0.
    mlp_batchnorm: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_batchnorm_last: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_linear_first: bool, Optional, default = None
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to `True`.

    Attributes
    ----------
    decoder: nn.Module
        deep dense Resnet model that will receive the output of the encoder IF
        `mlp_hidden_dims` is None
    mlp: nn.Module
        if `mlp_hidden_dims` is not None, the overall decoder will consist
        in an MLP that will receive the output of the encoder followed by the
        deep dense Resnet.

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import TabResnetDecoder
    >>> x_inp = torch.rand(3, 8)
    >>> decoder = TabResnetDecoder(embed_dim=32, blocks_dims=[8, 16, 16])
    >>> res = decoder(x_inp)
    >>> res.shape
    torch.Size([3, 32])
    """

    def __init__(
        self,
        embed_dim: int,
        blocks_dims: List[int] = [100, 100, 200],
        blocks_dropout: float = 0.1,
        simplify_blocks: bool = False,
        mlp_hidden_dims: Optional[List[int]] = None,
        mlp_activation: Optional[str] = None,
        mlp_dropout: Optional[float] = None,
        mlp_batchnorm: Optional[bool] = None,
        mlp_batchnorm_last: Optional[bool] = None,
        mlp_linear_first: Optional[bool] = None,
    ):
        super(TabResnetDecoder, self).__init__()

        if len(blocks_dims) < 2:
            raise ValueError(
                "'blocks' must contain at least two elements, e.g. [256, 128]"
            )

        self.embed_dim = embed_dim

        self.blocks_dims = blocks_dims
        self.blocks_dropout = blocks_dropout
        self.simplify_blocks = simplify_blocks

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        if self.mlp_hidden_dims is not None:
            self.mlp = MLP(
                d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
                activation=(
                    "relu" if self.mlp_activation is None else self.mlp_activation
                ),
                dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
                batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
                batchnorm_last=(
                    False
                    if self.mlp_batchnorm_last is None
                    else self.mlp_batchnorm_last
                ),
                linear_first=(
                    True if self.mlp_linear_first is None else self.mlp_linear_first
                ),
            )
            self.decoder = DenseResnet(
                self.mlp_hidden_dims[-1],
                blocks_dims,
                blocks_dropout,
                self.simplify_blocks,
            )
        else:
            self.mlp = None
            self.decoder = DenseResnet(
                blocks_dims[0], blocks_dims, blocks_dropout, self.simplify_blocks
            )

        self.reconstruction_layer = nn.Linear(blocks_dims[-1], embed_dim, bias=False)

    def forward(self, X: Tensor) -> Tensor:
        x = self.mlp(X) if self.mlp is not None else X
        return self.reconstruction_layer(self.decoder(x))

TabNet

Bases: BaseTabularModelWithoutAttention

Defines a TabNet model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

The implementation in this library is fully based on that here by the dreamquark-ai team, simply adapted so that it can work within the WideDeep frame. Therefore, ALL CREDIT TO THE DREAMQUARK-AI TEAM.

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
cat_embed_input Optional[List[Tuple[str, int, int]]]

List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal[batchnorm, layernorm]]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous Optional[bool]

Boolean indicating if the continuous columns will be embedded using one of the available methods: 'standard', 'periodic' or 'piecewise'. If None, it will default to 'False'.
ℹ️ NOTE: This parameter is deprecated and it will be removed in future releases. Please, use the embed_continuous_method parameter instead.

None
embed_continuous_method Optional[Literal[standard, piecewise, periodic]]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

None
cont_embed_dim Optional[int]

Size of the continuous embeddings. If the continuous columns are embedded, cont_embed_dim must be passed.

None
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
n_steps int

number of decision steps. For a better understanding of the function of n_steps and the upcoming parameters, please see the paper.

3
step_dim int

Step's output dimension. This is the output dimension that WideDeep will collect and connect to the output neuron(s).

8
attn_dim int

Attention dimension

8
dropout float

GLU block's internal dropout

0.0
n_glu_step_dependent int

number of GLU Blocks ([FC -> BN -> GLU]) that are step dependent

2
n_glu_shared int

number of GLU Blocks ([FC -> BN -> GLU]) that will be shared across decision steps

2
ghost_bn bool

Boolean indicating if Ghost Batch Normalization will be used.

True
virtual_batch_size int

Batch size when using Ghost Batch Normalization

128
momentum float

Ghost Batch Normalization's momentum. The dreamquark-ai advises for very low values. However high values are used in the original publication. During our tests higher values lead to better results

0.02
gamma float

Relaxation parameter in the paper. When gamma = 1, a feature is enforced to be used only at one decision step. As gamma increases, more flexibility is provided to use a feature at multiple decision steps

1.3
epsilon float

Float to avoid log(0). Always keep low

1e-15
mask_type str

Mask function to use. Either 'sparsemax' or 'entmax'

'sparsemax'

Attributes:

Name Type Description
encoder Module

the TabNet encoder. For details see the original publication.

Examples:

>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ["a", "b", "c", "d", "e"]
>>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
>>> column_idx = {k: v for v, k in enumerate(colnames)}
>>> model = TabNet(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=["e"])
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/tabnet/tab_net.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
class TabNet(BaseTabularModelWithoutAttention):
    r"""Defines a [TabNet model](https://arxiv.org/abs/1908.07442) that
    can be used as the `deeptabular` component of a Wide & Deep model or
    independently by itself.

    The implementation in this library is fully based on that
    [here](https://github.com/dreamquark-ai/tabnet) by the dreamquark-ai team,
    simply adapted so that it can work within the `WideDeep` frame.
    Therefore, **ALL CREDIT TO THE DREAMQUARK-AI TEAM**.

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `TabMlp` model. Required to slice the tensors. e.g. _{'education':
        0, 'relationship': 1, 'workclass': 2, ...}_.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name, number of unique values and
        embedding dimension. e.g. _[(education, 11, 32), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    continuous_cols: List, Optional, default = None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer: str, Optional, default =  None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous: bool, Optional, default = None,
        Boolean indicating if the continuous columns will be embedded using
        one of the available methods: _'standard'_, _'periodic'_
        or _'piecewise'_. If `None`, it will default to 'False'.<br/>
        :information_source: **NOTE**: This parameter is deprecated and it
         will be removed in future releases. Please, use the
         `embed_continuous_method` parameter instead.
    embed_continuous_method: Optional, str, default = None,
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dim: int, Optional, default = None,
        Size of the continuous embeddings. If the continuous columns are
        embedded, `cont_embed_dim` must be passed.
    cont_embed_dropout: float, Optional, default = None,
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation: Optional, str, default = None,
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup: Dict[str, List[float]], Optional, default = None,
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies: int, Optional, default = None,
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma: float, Optional, default = None,
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer: bool, Optional, default = None,
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    n_steps: int, default = 3
        number of decision steps. For a better understanding of the function
        of `n_steps` and the upcoming parameters, please see the
        [paper](https://arxiv.org/abs/1908.07442).
    step_dim: int, default = 8
        Step's output dimension. This is the output dimension that
        `WideDeep` will collect and connect to the output neuron(s).
    attn_dim: int, default = 8
        Attention dimension
    dropout: float, default = 0.0
        GLU block's internal dropout
    n_glu_step_dependent: int, default = 2
        number of GLU Blocks (`[FC -> BN -> GLU]`) that are step dependent
    n_glu_shared: int, default = 2
        number of GLU Blocks (`[FC -> BN -> GLU]`) that will be shared
        across decision steps
    ghost_bn: bool, default=True
        Boolean indicating if [Ghost Batch Normalization](https://arxiv.org/abs/1705.08741)
        will be used.
    virtual_batch_size: int, default = 128
        Batch size when using Ghost Batch Normalization
    momentum: float, default = 0.02
        Ghost Batch Normalization's momentum. The dreamquark-ai advises for
        very low values. However high values are used in the original
        publication. During our tests higher values lead to better results
    gamma: float, default = 1.3
        Relaxation parameter in the paper. When gamma = 1, a feature is
        enforced to be used only at one decision step. As gamma increases,
        more flexibility is provided to use a feature at multiple decision
        steps
    epsilon: float, default = 1e-15
        Float to avoid log(0). Always keep low
    mask_type: str, default = "sparsemax"
        Mask function to use. Either _'sparsemax'_ or _'entmax'_

    Attributes
    ----------
    encoder: nn.Module
        the TabNet encoder. For details see the [original publication](https://arxiv.org/abs/1908.07442).

    Examples
    --------
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ["a", "b", "c", "d", "e"]
    >>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
    >>> column_idx = {k: v for v, k in enumerate(colnames)}
    >>> model = TabNet(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=["e"])
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        column_idx: Dict[str, int],
        *,
        cat_embed_input: Optional[List[Tuple[str, int, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous: Optional[bool] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = None,
        cont_embed_dim: Optional[int] = None,
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        n_steps: int = 3,
        step_dim: int = 8,
        attn_dim: int = 8,
        dropout: float = 0.0,
        n_glu_step_dependent: int = 2,
        n_glu_shared: int = 2,
        ghost_bn: bool = True,
        virtual_batch_size: int = 128,
        momentum: float = 0.02,
        gamma: float = 1.3,
        epsilon: float = 1e-15,
        mask_type: str = "sparsemax",
    ):
        super(TabNet, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous=embed_continuous,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dim=cont_embed_dim,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.n_steps = n_steps
        self.step_dim = step_dim
        self.attn_dim = attn_dim
        self.dropout = dropout
        self.n_glu_step_dependent = n_glu_step_dependent
        self.n_glu_shared = n_glu_shared
        self.ghost_bn = ghost_bn
        self.virtual_batch_size = virtual_batch_size
        self.momentum = momentum
        self.gamma = gamma
        self.epsilon = epsilon
        self.mask_type = mask_type

        # Embeddings are instantiated at the base model
        self.embed_out_dim = self.cat_out_dim + self.cont_out_dim

        # TabNet
        self.encoder = TabNetEncoder(
            self.embed_out_dim,
            n_steps,
            step_dim,
            attn_dim,
            dropout,
            n_glu_step_dependent,
            n_glu_shared,
            ghost_bn,
            virtual_batch_size,
            momentum,
            gamma,
            epsilon,
            mask_type,
        )

    def forward(
        self, X: Tensor, prior: Optional[Tensor] = None
    ) -> Tuple[Tensor, Tensor]:
        x = self._get_embeddings(X)
        steps_output, M_loss = self.encoder(x, prior)
        res = torch.sum(torch.stack(steps_output, dim=0), dim=0)
        return (res, M_loss)

    def forward_masks(self, X: Tensor) -> Tuple[Tensor, Dict[int, Tensor]]:
        x = self._get_embeddings(X)
        return self.encoder.forward_masks(x)

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class
        """
        return self.step_dim

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

TabNetDecoder

Bases: Module

Companion decoder model for the TabNet model (which can be considered an encoder itself)

This class is designed to be used with the EncoderDecoderTrainer when using self-supervised pre-training (see the corresponding section in the docs). This class will receive the output from the TabNet encoder (i.e. the output from the so called 'steps') and 'reconstruct' the embeddings.

Parameters:

Name Type Description Default
embed_dim int

Size of the embeddings tensor to be reconstructed.

required
n_steps int

number of decision steps. For a better understanding of the function of n_steps and the upcoming parameters, please see the paper.

3
step_dim int

Step's output dimension. This is the output dimension that WideDeep will collect and connect to the output neuron(s).

8
dropout float

GLU block's internal dropout

0.0
n_glu_step_dependent int

number of GLU Blocks ([FC -> BN -> GLU]) that are step dependent

2
n_glu_shared int

number of GLU Blocks ([FC -> BN -> GLU]) that will be shared across decision steps

2
ghost_bn bool

Boolean indicating if Ghost Batch Normalization will be used.

True
virtual_batch_size int

Batch size when using Ghost Batch Normalization

128
momentum float

Ghost Batch Normalization's momentum. The dreamquark-ai advises for very low values. However high values are used in the original publication. During our tests higher values lead to better results

0.02

Attributes:

Name Type Description
decoder Module

decoder that will receive the output from the encoder's steps and will reconstruct the embeddings

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabNetDecoder
>>> x_inp = [torch.rand(3, 8), torch.rand(3, 8), torch.rand(3, 8)]
>>> decoder = TabNetDecoder(embed_dim=32, ghost_bn=False)
>>> res = decoder(x_inp)
>>> res.shape
torch.Size([3, 32])
Source code in pytorch_widedeep/models/tabular/tabnet/tab_net.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
class TabNetDecoder(nn.Module):
    r"""Companion decoder model for the `TabNet` model (which can be
    considered an encoder itself)

    This class is designed to be used with the `EncoderDecoderTrainer` when
    using self-supervised pre-training (see the corresponding section in the
    docs). This class will receive the output from the `TabNet` encoder
    (i.e. the output from the so called 'steps') and '_reconstruct_' the
    embeddings.

    Parameters
    ----------
    embed_dim: int
        Size of the embeddings tensor to be reconstructed.
    n_steps: int, default = 3
        number of decision steps. For a better understanding of the function
        of `n_steps` and the upcoming parameters, please see the
        [paper](https://arxiv.org/abs/1908.07442).
    step_dim: int, default = 8
        Step's output dimension. This is the output dimension that
        `WideDeep` will collect and connect to the output neuron(s).
    dropout: float, default = 0.0
        GLU block's internal dropout
    n_glu_step_dependent: int, default = 2
        number of GLU Blocks (`[FC -> BN -> GLU]`) that are step dependent
    n_glu_shared: int, default = 2
        number of GLU Blocks (`[FC -> BN -> GLU]`) that will be shared
        across decision steps
    ghost_bn: bool, default=True
        Boolean indicating if [Ghost Batch Normalization](https://arxiv.org/abs/1705.08741)
        will be used.
    virtual_batch_size: int, default = 128
        Batch size when using Ghost Batch Normalization
    momentum: float, default = 0.02
        Ghost Batch Normalization's momentum. The dreamquark-ai advises for
        very low values. However high values are used in the original
        publication. During our tests higher values lead to better results

    Attributes
    ----------
    decoder: nn.Module
        decoder that will receive the output from the encoder's steps and will
        reconstruct the embeddings

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import TabNetDecoder
    >>> x_inp = [torch.rand(3, 8), torch.rand(3, 8), torch.rand(3, 8)]
    >>> decoder = TabNetDecoder(embed_dim=32, ghost_bn=False)
    >>> res = decoder(x_inp)
    >>> res.shape
    torch.Size([3, 32])
    """

    def __init__(
        self,
        embed_dim: int,
        n_steps: int = 3,
        step_dim: int = 8,
        dropout: float = 0.0,
        n_glu_step_dependent: int = 2,
        n_glu_shared: int = 2,
        ghost_bn: bool = True,
        virtual_batch_size: int = 128,
        momentum: float = 0.02,
    ):
        super(TabNetDecoder, self).__init__()

        self.n_steps = n_steps
        self.step_dim = step_dim
        self.dropout = dropout
        self.n_glu_step_dependent = n_glu_step_dependent
        self.n_glu_shared = n_glu_shared
        self.ghost_bn = ghost_bn
        self.virtual_batch_size = virtual_batch_size
        self.momentum = momentum

        shared_layers = nn.ModuleList()
        for i in range(n_glu_shared):
            if i == 0:
                shared_layers.append(nn.Linear(step_dim, 2 * step_dim, bias=False))
            else:
                shared_layers.append(nn.Linear(step_dim, 2 * step_dim, bias=False))

        self.decoder = nn.ModuleList()
        for step in range(n_steps):
            transformer = FeatTransformer(
                step_dim,
                step_dim,
                dropout,
                shared_layers,
                n_glu_step_dependent,
                ghost_bn,
                virtual_batch_size,
                momentum=momentum,
            )
            self.decoder.append(transformer)

        self.reconstruction_layer = nn.Linear(step_dim, embed_dim, bias=False)
        initialize_non_glu(self.reconstruction_layer, step_dim, embed_dim)

    def forward(self, X: List[Tensor]) -> Tensor:
        out = torch.tensor(0.0)
        for i, x in enumerate(X):
            x = self.decoder[i](x)
            out = torch.add(out, x)
        out = self.reconstruction_layer(out)
        return out

ContextAttentionMLP

Bases: BaseTabularModelWithAttention

Defines a ContextAttentionMLP model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class combines embedding representations of the categorical features with numerical (aka continuous) features that are also embedded. These are then passed through a series of attention blocks. Each attention block is comprised by a ContextAttentionEncoder. Such encoder is in part inspired by the attention mechanism described in Hierarchical Attention Networks for Document Classification. See pytorch_widedeep.models.tabular.mlp._attention_layers for details.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
cat_embed_input Optional[List[Tuple[str, int]]]

List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
shared_embed Optional[bool]

Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

None
add_shared_embed Optional[bool]

The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

None
frac_shared_embed Optional[float]

The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal[batchnorm, layernorm]]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal[standard, piecewise, periodic]]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

'standard'
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
input_dim int

The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns

32
attn_dropout float

Dropout for each attention block

0.2
with_addnorm bool

Boolean indicating if residual connections will be used in the attention blocks

False
attn_activation str

String indicating the activation function to be applied to the dense layer in each attention encoder. 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported.

'leaky_relu'
n_blocks int

Number of attention blocks

3

Attributes:

Name Type Description
encoder Module

Sequence of attention encoders.

Examples:

>>> import torch
>>> from pytorch_widedeep.models import ContextAttentionMLP
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = ContextAttentionMLP(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols = ['e'])
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/mlp/context_attention_mlp.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
class ContextAttentionMLP(BaseTabularModelWithAttention):
    r"""Defines a `ContextAttentionMLP` model that can be used as the
    `deeptabular` component of a Wide & Deep model or independently by
    itself.

    This class combines embedding representations of the categorical features
    with numerical (aka continuous) features that are also embedded. These
    are then passed through a series of attention blocks. Each attention
    block is comprised by a `ContextAttentionEncoder`. Such encoder is in
    part inspired by the attention mechanism described in
    [Hierarchical Attention Networks for Document
    Classification](https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf).
    See `pytorch_widedeep.models.tabular.mlp._attention_layers` for details.

    Most of the parameters for this class are `Optional` since the use of
    categorical or continuous is in fact optional (i.e. one can use
    categorical features only, continuous features only or both).

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `TabMlp` model. Required to slice the tensors. e.g. _{'education':
        0, 'relationship': 1, 'workclass': 2, ...}_.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name and number of unique values and
        embedding dimension. e.g. _[(education, 11), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    shared_embed: bool, Optional, default = None
        Boolean indicating if the embeddings will be "shared". The idea behind `shared_embed` is
        described in the Appendix A in the [TabTransformer paper](https://arxiv.org/abs/2012.06678):
        _'The goal of having column embedding is to enable the model to
        distinguish the classes in one column from those in the other
        columns'_. In other words, the idea is to let the model learn which
        column is embedded at the time. See: `pytorch_widedeep.models.transformers._layers.SharedEmbeddings`.
    add_shared_embed: bool, Optional, default = None
        The two embedding sharing strategies are: 1) add the shared embeddings
        to the column embeddings or 2) to replace the first
        `frac_shared_embed` with the shared embeddings.
        See `pytorch_widedeep.models.embeddings_layers.SharedEmbeddings`
        If 'None' is passed, it will default to 'False'.
    frac_shared_embed: float, Optional, default = None
        The fraction of embeddings that will be shared (if `add_shared_embed
        = False`) by all the different categories for one particular
        column. If 'None' is passed, it will default to 0.0.
    continuous_cols: List, Optional, default = None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer: str, Optional, default =  None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional, str, default = None,
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout: float, Optional, default = None,
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation: Optional, str, default = None,
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup: Dict[str, List[float]], Optional, default = None,
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies: int, Optional, default = None,
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma: float, Optional, default = None,
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer: bool, Optional, default = None,
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    input_dim: int, default = 32
        The so-called *dimension of the model*. Is the number of embeddings
        used to encode the categorical and/or continuous columns
    attn_dropout: float, default = 0.2
        Dropout for each attention block
    with_addnorm: bool = False,
        Boolean indicating if residual connections will be used in the
        attention blocks
    attn_activation: str, default = "leaky_relu"
        String indicating the activation function to be applied to the dense
        layer in each attention encoder. _'tanh'_, _'relu'_, _'leaky_relu'_
        and _'gelu'_ are supported.
    n_blocks: int, default = 3
        Number of attention blocks

    Attributes
    ----------
    encoder: nn.Module
        Sequence of attention encoders.

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import ContextAttentionMLP
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ['a', 'b', 'c', 'd', 'e']
    >>> cat_embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
    >>> column_idx = {k:v for v,k in enumerate(colnames)}
    >>> model = ContextAttentionMLP(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols = ['e'])
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        column_idx: Dict[str, int],
        *,
        cat_embed_input: Optional[List[Tuple[str, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        shared_embed: Optional[bool] = None,
        add_shared_embed: Optional[bool] = None,
        frac_shared_embed: Optional[float] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = "standard",
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        input_dim: int = 32,
        attn_dropout: float = 0.2,
        with_addnorm: bool = False,
        attn_activation: str = "leaky_relu",
        n_blocks: int = 3,
    ):
        super(ContextAttentionMLP, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=shared_embed,
            add_shared_embed=add_shared_embed,
            frac_shared_embed=frac_shared_embed,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            input_dim=input_dim,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.attn_dropout = attn_dropout
        self.with_addnorm = with_addnorm
        self.attn_activation = attn_activation
        self.n_blocks = n_blocks

        self.with_cls_token = "cls_token" in column_idx
        self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
        self.n_cont = len(continuous_cols) if continuous_cols is not None else 0

        # Embeddings are instantiated at the base model
        # Attention Blocks
        self.encoder = nn.Sequential()
        for i in range(n_blocks):
            self.encoder.add_module(
                "attention_block" + str(i),
                ContextAttentionEncoder(
                    input_dim,
                    attn_dropout,
                    with_addnorm,
                    attn_activation,
                ),
            )

    def forward(self, X: Tensor) -> Tensor:
        x = self._get_embeddings(X)
        x = self.encoder(x)
        if self.with_cls_token:
            out = x[:, 0, :]
        else:
            out = x.flatten(1)
        return out

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class
        """
        return (
            self.input_dim
            if self.with_cls_token
            else ((self.n_cat + self.n_cont) * self.input_dim)
        )

    @property
    def attention_weights(self) -> List[Tensor]:
        r"""List with the attention weights per block

        The shape of the attention weights is $(N, F)$, where $N$ is the batch
        size and $F$ is the number of features/columns in the dataset
        """
        return [blk.attn.attn_weights for blk in self.encoder]

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights per block

The shape of the attention weights is \((N, F)\), where \(N\) is the batch size and \(F\) is the number of features/columns in the dataset

SelfAttentionMLP

Bases: BaseTabularModelWithAttention

Defines a SelfAttentionMLP model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class combines embedding representations of the categorical features with numerical (aka continuous) features that are also embedded. These are then passed through a series of attention blocks. Each attention block is comprised by what we would refer as a simplified SelfAttentionEncoder. See pytorch_widedeep.models.tabular.mlp._attention_layers for details. The reason to use a simplified version of self attention is because we observed that the 'standard' attention mechanism used in the TabTransformer has a notable tendency to overfit.

In more detail, this model only uses Q and K (and not V). If we think about it as in terms of text (and intuitively), the Softmax(QK^T) is the attention mechanism that tells us how much, at each position in the input sentence, each word is represented or 'expressed'. We refer to that as 'attention weights'. These attention weighst are normally multiplied by a Value matrix to further strength the focus on the words that each word should be attending to (again, intuitively).

In this implementation we skip this last multiplication and instead we multiply the attention weights directly by the input tensor. This is a simplification that we expect is beneficial in terms of avoiding overfitting for tabular data.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
cat_embed_input Optional[List[Tuple[str, int]]]

List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
shared_embed Optional[bool]

Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

None
add_shared_embed Optional[bool]

The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

None
frac_shared_embed Optional[float]

The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal[batchnorm, layernorm]]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal[standard, piecewise, periodic]]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

'standard'
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
input_dim int

The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns

32
attn_dropout float

Dropout for each attention block

0.2
n_heads int

Number of attention heads per attention block.

8
use_bias bool

Boolean indicating whether or not to use bias in the Q, K projection layers.

False
with_addnorm bool

Boolean indicating if residual connections will be used in the attention blocks

False
attn_activation str

String indicating the activation function to be applied to the dense layer in each attention encoder. 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported.

'leaky_relu'
n_blocks int

Number of attention blocks

3

Attributes:

Name Type Description
cat_and_cont_embed Module

This is the module that processes the categorical and continuous columns

encoder Module

Sequence of attention encoders.

Examples:

>>> import torch
>>> from pytorch_widedeep.models import SelfAttentionMLP
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = SelfAttentionMLP(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols = ['e'])
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/mlp/self_attention_mlp.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
class SelfAttentionMLP(BaseTabularModelWithAttention):
    r"""Defines a `SelfAttentionMLP` model that can be used as the
    deeptabular component of a Wide & Deep model or independently by
    itself.

    This class combines embedding representations of the categorical features
    with numerical (aka continuous) features that are also embedded. These
    are then passed through a series of attention blocks. Each attention
    block is comprised by what we would refer as a simplified
    `SelfAttentionEncoder`. See
    `pytorch_widedeep.models.tabular.mlp._attention_layers` for details. The
    reason to use a simplified version of self attention is because we
    observed that the '_standard_' attention mechanism used in the
    TabTransformer has a notable tendency to overfit.

    In more detail, this model only uses Q and K (and not V). If we think
    about it as in terms of text (and intuitively), the Softmax(QK^T) is the
    attention mechanism that tells us how much, at each position in the input
    sentence, each word is represented or 'expressed'. We refer to that
    as 'attention weights'. These attention weighst are normally multiplied
    by a Value matrix to further strength the focus on the words that each
    word should be attending to (again, intuitively).

    In this implementation we skip this last multiplication and instead we
    multiply the attention weights directly by the input tensor. This is a
    simplification that we expect is beneficial in terms of avoiding
    overfitting for tabular data.

    Most of the parameters for this class are `Optional` since the use of
    categorical or continuous is in fact optional (i.e. one can use
    categorical features only, continuous features only or both).

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `TabMlp` model. Required to slice the tensors. e.g. _{'education':
        0, 'relationship': 1, 'workclass': 2, ...}_.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name and number of unique values and
        embedding dimension. e.g. _[(education, 11), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    shared_embed: bool, Optional, default = None
        Boolean indicating if the embeddings will be "shared". The idea behind `shared_embed` is
        described in the Appendix A in the [TabTransformer paper](https://arxiv.org/abs/2012.06678):
        _'The goal of having column embedding is to enable the model to
        distinguish the classes in one column from those in the other
        columns'_. In other words, the idea is to let the model learn which
        column is embedded at the time. See: `pytorch_widedeep.models.transformers._layers.SharedEmbeddings`.
    add_shared_embed: bool, Optional, default = None
        The two embedding sharing strategies are: 1) add the shared embeddings
        to the column embeddings or 2) to replace the first
        `frac_shared_embed` with the shared embeddings.
        See `pytorch_widedeep.models.embeddings_layers.SharedEmbeddings`
        If 'None' is passed, it will default to 'False'.
    frac_shared_embed: float, Optional, default = None
        The fraction of embeddings that will be shared (if `add_shared_embed
        = False`) by all the different categories for one particular
        column. If 'None' is passed, it will default to 0.0.
    continuous_cols: List, Optional, default = None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer: str, Optional, default =  None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional, str, default = None,
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout: float, Optional, default = None,
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation: Optional, str, default = None,
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup: Dict[str, List[float]], Optional, default = None,
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies: int, Optional, default = None,
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma: float, Optional, default = None,
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer: bool, Optional, default = None,
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    input_dim: int, default = 32
        The so-called *dimension of the model*. Is the number of
        embeddings used to encode the categorical and/or continuous columns
    attn_dropout: float, default = 0.2
        Dropout for each attention block
    n_heads: int, default = 8
        Number of attention heads per attention block.
    use_bias: bool, default = False
        Boolean indicating whether or not to use bias in the Q, K projection
        layers.
    with_addnorm: bool = False,
        Boolean indicating if residual connections will be used in the attention blocks
    attn_activation: str, default = "leaky_relu"
        String indicating the activation function to be applied to the dense
        layer in each attention encoder. _'tanh'_, _'relu'_, _'leaky_relu'_
        and _'gelu'_ are supported.
    n_blocks: int, default = 3
        Number of attention blocks

    Attributes
    ----------
    cat_and_cont_embed: nn.Module
        This is the module that processes the categorical and continuous columns
    encoder: nn.Module
        Sequence of attention encoders.

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import SelfAttentionMLP
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ['a', 'b', 'c', 'd', 'e']
    >>> cat_embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
    >>> column_idx = {k:v for v,k in enumerate(colnames)}
    >>> model = SelfAttentionMLP(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols = ['e'])
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        column_idx: Dict[str, int],
        *,
        cat_embed_input: Optional[List[Tuple[str, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        shared_embed: Optional[bool] = None,
        add_shared_embed: Optional[bool] = None,
        frac_shared_embed: Optional[float] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = "standard",
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        input_dim: int = 32,
        attn_dropout: float = 0.2,
        n_heads: int = 8,
        use_bias: bool = False,
        with_addnorm: bool = False,
        attn_activation: str = "leaky_relu",
        n_blocks: int = 3,
    ):
        super(SelfAttentionMLP, self).__init__(
            column_idx=column_idx,
            input_dim=input_dim,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=shared_embed,
            add_shared_embed=add_shared_embed,
            frac_shared_embed=frac_shared_embed,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.attn_dropout = attn_dropout
        self.n_heads = n_heads
        self.use_bias = use_bias
        self.with_addnorm = with_addnorm
        self.attn_activation = attn_activation
        self.n_blocks = n_blocks

        self.with_cls_token = "cls_token" in column_idx
        self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
        self.n_cont = len(continuous_cols) if continuous_cols is not None else 0

        # Embeddings are instantiated at the base model
        # Attention Blocks
        self.encoder = nn.Sequential()
        for i in range(n_blocks):
            self.encoder.add_module(
                "attention_block" + str(i),
                SelfAttentionEncoder(
                    input_dim,
                    attn_dropout,
                    use_bias,
                    n_heads,
                    with_addnorm,
                    attn_activation,
                ),
            )

    def forward(self, X: Tensor) -> Tensor:
        x = self._get_embeddings(X)
        x = self.encoder(x)
        if self.with_cls_token:
            out = x[:, 0, :]
        else:
            out = x.flatten(1)
        return out

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the WideDeep class
        """
        return (
            self.input_dim
            if self.with_cls_token
            else ((self.n_cat + self.n_cont) * self.input_dim)
        )

    @property
    def attention_weights(self) -> List[Tensor]:
        r"""List with the attention weights per block

        The shape of the attention weights is $(N, H, F, F)$, where $N$ is the
        batch size, $H$ is the number of attention heads and $F$ is the
        number of features/columns in the dataset
        """
        return [blk.attn.attn_weights for blk in self.encoder]

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights per block

The shape of the attention weights is \((N, H, F, F)\), where \(N\) is the batch size, \(H\) is the number of attention heads and \(F\) is the number of features/columns in the dataset

TabTransformer

Bases: BaseTabularModelWithAttention

Defines our adptation of the TabTransformer model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

ℹ️ NOTE: This is an enhanced adaptation of the model described in the paper. It can be considered as the flagship of our transformer family of models for tabular data and offers mutiple, additional features relative to the original publication(and some other models in the library)

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
cat_embed_input Optional[List[Tuple[str, int]]]

List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
shared_embed Optional[bool]

Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

None
add_shared_embed Optional[bool]

The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

None
frac_shared_embed Optional[float]

The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal[batchnorm, layernorm]]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal[standard, piecewise, periodic]]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

None
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
input_dim int

The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns

32
n_heads int

Number of attention heads per Transformer block

8
use_qkv_bias bool

Boolean indicating whether or not to use bias in the Q, K, and V projection layers.

False
n_blocks int

Number of Transformer blocks

4
attn_dropout float

Dropout that will be applied to the Multi-Head Attention layers

0.2
ff_dropout float

Dropout that will be applied to the FeedForward network

0.1
ff_factor int

Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4.

4
transformer_activation str

Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

'gelu'
use_linear_attention bool

Boolean indicating if Linear Attention (from Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention) will be used. The inclusing of this mode of attention is inspired by this post, where the Uber team finds that this attention mechanism leads to the best results for their tabular data.

False
use_flash_attention bool

Boolean indicating if Flash Attention will be used.

False
mlp_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If not provided no MLP on top of the final Transformer block will be used.

None
mlp_activation str

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

'relu'
mlp_dropout float

float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

0.1
mlp_batchnorm bool

Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

False
mlp_batchnorm_last bool

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

False
mlp_linear_first bool

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

True

Attributes:

Name Type Description
encoder Module

Sequence of Transformer blocks

mlp Module

MLP component in the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabTransformer
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabTransformer(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/transformers/tab_transformer.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
class TabTransformer(BaseTabularModelWithAttention):
    r"""Defines our adptation of the
    [TabTransformer model](https://arxiv.org/abs/2012.06678)
    that can be used as the `deeptabular` component of a
    Wide & Deep model or independently by itself.

    Most of the parameters for this class are `Optional` since the use of
    categorical or continuous is in fact optional (i.e. one can use
    categorical features only, continuous features only or both).

    :information_source: **NOTE**:
    This is an enhanced adaptation of the model described in the paper. It can
    be considered as the flagship of our transformer family of models for
    tabular data and offers mutiple, additional features relative to the
    original publication(and some other models in the library)

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `TabMlp` model. Required to slice the tensors. e.g. _{'education':
        0, 'relationship': 1, 'workclass': 2, ...}_.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name and number of unique values and
        embedding dimension. e.g. _[(education, 11), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    shared_embed: bool, Optional, default = None
        Boolean indicating if the embeddings will be "shared". The idea behind `shared_embed` is
        described in the Appendix A in the [TabTransformer paper](https://arxiv.org/abs/2012.06678):
        _'The goal of having column embedding is to enable the model to
        distinguish the classes in one column from those in the other
        columns'_. In other words, the idea is to let the model learn which
        column is embedded at the time. See: `pytorch_widedeep.models.transformers._layers.SharedEmbeddings`.
    add_shared_embed: bool, Optional, default = None
        The two embedding sharing strategies are: 1) add the shared embeddings
        to the column embeddings or 2) to replace the first
        `frac_shared_embed` with the shared embeddings.
        See `pytorch_widedeep.models.embeddings_layers.SharedEmbeddings`
        If 'None' is passed, it will default to 'False'.
    frac_shared_embed: float, Optional, default = None
        The fraction of embeddings that will be shared (if `add_shared_embed
        = False`) by all the different categories for one particular
        column. If 'None' is passed, it will default to 0.0.
    continuous_cols: List, Optional, default = None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer: str, Optional, default =  None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional, str, default = None,
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout: float, Optional, default = None,
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation: Optional, str, default = None,
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup: Dict[str, List[float]], Optional, default = None,
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies: int, Optional, default = None,
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma: float, Optional, default = None,
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer: bool, Optional, default = None,
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    input_dim: int, default = 32
        The so-called *dimension of the model*. Is the number of
        embeddings used to encode the categorical and/or continuous columns
    n_heads: int, default = 8
        Number of attention heads per Transformer block
    use_qkv_bias: bool, default = False
        Boolean indicating whether or not to use bias in the Q, K, and V
        projection layers.
    n_blocks: int, default = 4
        Number of Transformer blocks
    attn_dropout: float, default = 0.2
        Dropout that will be applied to the Multi-Head Attention layers
    ff_dropout: float, default = 0.1
        Dropout that will be applied to the FeedForward network
    ff_factor: float, default = 4
        Multiplicative factor applied to the first layer of the FF network in
        each Transformer block, This is normally set to 4.
    transformer_activation: str, default = "gelu"
        Transformer Encoder activation function. _'tanh'_, _'relu'_,
        _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
    use_linear_attention: Boolean, default = False,
        Boolean indicating if Linear Attention (from [Transformers are RNNs:
        Fast Autoregressive Transformers with Linear Attention](https://arxiv.org/abs/2006.16236))
        will be used. The inclusing of this mode of attention is inspired by
        [this post](https://www.uber.com/en-GB/blog/deepeta-how-uber-predicts-arrival-times/),
        where the Uber team finds that this attention mechanism leads to the
        best results for their tabular data.
    use_flash_attention: Boolean, default = False,
        Boolean indicating if
        [Flash Attention](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
        will be used. <br/>
    mlp_hidden_dims: List, Optional, default = None
        List with the number of neurons per dense layer in the MLP. e.g:
        _[64, 32]_. If not provided no MLP on top of the final
        Transformer block will be used.
    mlp_activation: str, Optional, default = None
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky'_relu' and _'gelu'_ are supported.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to _'relu'_.
    mlp_dropout: float, Optional, default = None
        float with the dropout between the dense layers of the MLP.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to 0.0.
    mlp_batchnorm: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_batchnorm_last: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_linear_first: bool, Optional, default = None
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to `True`.

    Attributes
    ----------
    encoder: nn.Module
        Sequence of Transformer blocks
    mlp: nn.Module
        MLP component in the model

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import TabTransformer
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ['a', 'b', 'c', 'd', 'e']
    >>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
    >>> continuous_cols = ['e']
    >>> column_idx = {k:v for v,k in enumerate(colnames)}
    >>> model = TabTransformer(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        column_idx: Dict[str, int],
        *,
        cat_embed_input: Optional[List[Tuple[str, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        shared_embed: Optional[bool] = None,
        add_shared_embed: Optional[bool] = None,
        frac_shared_embed: Optional[float] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous: Optional[bool] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = None,
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        input_dim: int = 32,
        n_heads: int = 8,
        use_qkv_bias: bool = False,
        n_blocks: int = 4,
        attn_dropout: float = 0.2,
        ff_dropout: float = 0.1,
        ff_factor: int = 4,
        transformer_activation: str = "gelu",
        use_linear_attention: bool = False,
        use_flash_attention: bool = False,
        mlp_hidden_dims: Optional[List[int]] = None,
        mlp_activation: str = "relu",
        mlp_dropout: float = 0.1,
        mlp_batchnorm: bool = False,
        mlp_batchnorm_last: bool = False,
        mlp_linear_first: bool = True,
    ):
        super(TabTransformer, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=shared_embed,
            add_shared_embed=add_shared_embed,
            frac_shared_embed=frac_shared_embed,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            # UGLY hack to be able to use the base class __init__ method
            embed_continuous_method=(
                "standard"
                if embed_continuous_method is None
                else embed_continuous_method
            ),
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            input_dim=input_dim,
            full_embed_dropout=full_embed_dropout,
        )

        # we overwrite the embed_continuous_method parameter that was set
        # to 'standard' if embed_continuous_method is None. In other words,
        # this will already have the right value if the user provided a value
        # for embed_continuous_method, otherwise will be "standard" when it
        # should be None
        self.embed_continuous_method = embed_continuous_method

        if embed_continuous is not None:
            self.embed_continuous = embed_continuous
            if embed_continuous and embed_continuous_method is None:
                raise ValueError(
                    "If 'embed_continuous' is True, 'embed_continuous_method' must be "
                    "one of 'standard', 'piecewise' or 'periodic'."
                )
        elif self.embed_continuous_method is not None:
            self.embed_continuous = True
        else:
            self.embed_continuous = False

        self.n_heads = n_heads
        self.use_qkv_bias = use_qkv_bias
        self.n_blocks = n_blocks
        self.attn_dropout = attn_dropout
        self.ff_dropout = ff_dropout
        self.transformer_activation = transformer_activation
        self.use_linear_attention = use_linear_attention
        self.use_flash_attention = use_flash_attention
        self.ff_factor = ff_factor

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        self.with_cls_token = "cls_token" in column_idx
        self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
        self.n_cont = len(continuous_cols) if continuous_cols is not None else 0

        if self.n_cont and not self.n_cat and not self.embed_continuous:
            raise ValueError(
                "If only continuous features are used 'embed_continuous' must be set to 'True'"
            )

        # Embeddings are instantiated at the base model
        # Transformer blocks
        self.encoder = nn.Sequential()
        for i in range(n_blocks):
            self.encoder.add_module(
                "transformer_block" + str(i),
                TransformerEncoder(
                    input_dim,
                    n_heads,
                    use_qkv_bias,
                    attn_dropout,
                    ff_dropout,
                    ff_factor,
                    transformer_activation,
                    use_linear_attention,
                    use_flash_attention,
                ),
            )

        self.mlp_first_hidden_dim = self._mlp_first_hidden_dim()

        if self.mlp_hidden_dims is not None:
            self.mlp = MLP(
                d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
                activation=(
                    "relu" if self.mlp_activation is None else self.mlp_activation
                ),
                dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
                batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
                batchnorm_last=(
                    False
                    if self.mlp_batchnorm_last is None
                    else self.mlp_batchnorm_last
                ),
                linear_first=(
                    False if self.mlp_linear_first is None else self.mlp_linear_first
                ),
            )
        else:
            self.mlp = None

    def forward(self, X: Tensor) -> Tensor:
        x_embed, x_cont_not_embed = self._get_embeddings_tt(X)
        x = self.encoder(x_embed)

        if self.with_cls_token:
            x = x[:, 0, :]
        else:
            x = x.flatten(1)

        if x_cont_not_embed is not None:
            x = torch.cat([x, x_cont_not_embed], 1)

        if self.mlp is not None:
            x = self.mlp(x)
        return x

    def _mlp_first_hidden_dim(self) -> int:
        if self.with_cls_token:
            if self.embed_continuous:
                attn_output_dim = self.input_dim
            else:
                attn_output_dim = self.input_dim + self.n_cont
        elif self.embed_continuous:
            attn_output_dim = (self.n_cat + self.n_cont) * self.input_dim
        else:
            attn_output_dim = self.n_cat * self.input_dim + self.n_cont

        return attn_output_dim

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class
        """
        return (
            self.mlp_hidden_dims[-1]
            if self.mlp_hidden_dims is not None
            else self.mlp_first_hidden_dim
        )

    @property
    def attention_weights(self) -> List[Tensor]:
        r"""List with the attention weights per block

        The shape of the attention weights is $(N, H, F, F)$, where $N$ is the
        batch size, $H$ is the number of attention heads and $F$ is the
        number of features/columns in the dataset

        :information_source: **NOTE**:
        if flash attention or linear attention
        are used, no attention weights are saved during the training process
        and calling this property will throw a ValueError
        """
        if self.use_flash_attention or self.use_linear_attention:
            raise ValueError(
                "Extraction of the attention weights is not supported for "
                "linear or flash attention"
            )

        return [blk.attn.attn_weights for blk in self.encoder]

    def _get_embeddings_tt(self, X: Tensor) -> Tuple[Tensor, Optional[Tensor]]:  # type: ignore[override]
        if self.n_cont and self.embed_continuous and self.n_cat:
            # there are continuous and categorical features and the continuous
            # features are embedded
            x_embed = self._get_embeddings(X)
            x_cont_not_embed = None
        elif self.n_cont and self.embed_continuous:
            # there are only continuous features and they are embedded
            x_embed = self.cont_norm((X[:, self.cont_idx].float()))
            x_embed = self.cont_embed(x_embed)
            x_cont_not_embed = None
        elif self.n_cont and self.n_cat:
            # there are continuous and categorical features but the continuous
            # features are not embedded
            x_embed = self.cat_embed(X)
            x_cont_not_embed = self.cont_norm((X[:, self.cont_idx].float()))
        elif self.n_cat:
            # there are only categorical features
            x_embed = self.cat_embed(X)
            x_cont_not_embed = None
        else:
            raise ValueError(
                "Using only continuous features that are not embedded is "
                "is not supported in this implementation. Please set 'embed_continuous_method' "
                "to one of ['standard', 'piecewise', 'periodic']"
            )
        return x_embed, x_cont_not_embed

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights per block

The shape of the attention weights is \((N, H, F, F)\), where \(N\) is the batch size, \(H\) is the number of attention heads and \(F\) is the number of features/columns in the dataset

ℹ️ NOTE: if flash attention or linear attention are used, no attention weights are saved during the training process and calling this property will throw a ValueError

SAINT

Bases: BaseTabularModelWithAttention

Defines a SAINT model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

ℹ️ NOTE: This is an slightly modified and enhanced version of the model described in the paper,

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
cat_embed_input Optional[List[Tuple[str, int]]]

List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
shared_embed Optional[bool]

Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

None
add_shared_embed Optional[bool]

The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

None
frac_shared_embed Optional[float]

The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal[batchnorm, layernorm]]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal[standard, piecewise, periodic]]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

'standard'
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
input_dim int

The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns

32
n_heads int

Number of attention heads per Transformer block

8
use_qkv_bias bool

Boolean indicating whether or not to use bias in the Q, K, and V projection layers

False
n_blocks int

Number of SAINT-Transformer blocks.

2
attn_dropout float

Dropout that will be applied to the Multi-Head Attention column and row layers

0.1
ff_dropout float

Dropout that will be applied to the FeedForward network

0.2
ff_factor int

Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4.

4
transformer_activation str

Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

'gelu'
mlp_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If not provided no MLP on top of the final Transformer block will be used.

None
mlp_activation Optional[str]

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

None
mlp_dropout Optional[float]

float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

None
mlp_batchnorm Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_batchnorm_last Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_linear_first Optional[bool]

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

None

Attributes:

Name Type Description
encoder Module

Sequence of SAINT-Transformer blocks

mlp Module

MLP component in the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import SAINT
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = SAINT(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/transformers/saint.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
class SAINT(BaseTabularModelWithAttention):
    r"""Defines a [SAINT model](https://arxiv.org/abs/2106.01342) that
    can be used as the `deeptabular` component of a Wide & Deep model or
    independently by itself.

    Most of the parameters for this class are `Optional` since the use of
    categorical or continuous is in fact optional (i.e. one can use
    categorical features only, continuous features only or both).

    :information_source: **NOTE**: This is an slightly modified and enhanced
     version of the model described in the paper,

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `TabMlp` model. Required to slice the tensors. e.g. _{'education':
        0, 'relationship': 1, 'workclass': 2, ...}_.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name and number of unique values and
        embedding dimension. e.g. _[(education, 11), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    shared_embed: bool, Optional, default = None
        Boolean indicating if the embeddings will be "shared". The idea behind `shared_embed` is
        described in the Appendix A in the [TabTransformer paper](https://arxiv.org/abs/2012.06678):
        _'The goal of having column embedding is to enable the model to
        distinguish the classes in one column from those in the other
        columns'_. In other words, the idea is to let the model learn which
        column is embedded at the time. See: `pytorch_widedeep.models.transformers._layers.SharedEmbeddings`.
    add_shared_embed: bool, Optional, default = None
        The two embedding sharing strategies are: 1) add the shared embeddings
        to the column embeddings or 2) to replace the first
        `frac_shared_embed` with the shared embeddings.
        See `pytorch_widedeep.models.embeddings_layers.SharedEmbeddings`
        If 'None' is passed, it will default to 'False'.
    frac_shared_embed: float, Optional, default = None
        The fraction of embeddings that will be shared (if `add_shared_embed
        = False`) by all the different categories for one particular
        column. If 'None' is passed, it will default to 0.0.
    continuous_cols: List, Optional, default = None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer: str, Optional, default =  None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional, str, default = None,
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout: float, Optional, default = None,
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation: Optional, str, default = None,
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup: Dict[str, List[float]], Optional, default = None,
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies: int, Optional, default = None,
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma: float, Optional, default = None,
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer: bool, Optional, default = None,
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    input_dim: int, default = 32
        The so-called *dimension of the model*. Is the number of
        embeddings used to encode the categorical and/or continuous columns
    n_heads: int, default = 8
        Number of attention heads per Transformer block
    use_qkv_bias: bool, default = False
        Boolean indicating whether or not to use bias in the Q, K, and V
        projection layers
    n_blocks: int, default = 2
        Number of SAINT-Transformer blocks.
    attn_dropout: float, default = 0.2
        Dropout that will be applied to the Multi-Head Attention column and
        row layers
    ff_dropout: float, default = 0.1
        Dropout that will be applied to the FeedForward network
    ff_factor: float, default = 4
        Multiplicative factor applied to the first layer of the FF network in
        each Transformer block, This is normally set to 4.
    transformer_activation: str, default = "gelu"
        Transformer Encoder activation function. _'tanh'_, _'relu'_,
        _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
    mlp_hidden_dims: List, Optional, default = None
        List with the number of neurons per dense layer in the MLP. e.g:
        _[64, 32]_. If not provided no MLP on top of the final
        Transformer block will be used.
    mlp_activation: str, Optional, default = None
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky'_relu' and _'gelu'_ are supported.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to _'relu'_.
    mlp_dropout: float, Optional, default = None
        float with the dropout between the dense layers of the MLP.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to 0.0.
    mlp_batchnorm: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_batchnorm_last: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_linear_first: bool, Optional, default = None
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to `True`.

    Attributes
    ----------
    encoder: nn.Module
        Sequence of SAINT-Transformer blocks
    mlp: nn.Module
        MLP component in the model

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import SAINT
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ['a', 'b', 'c', 'd', 'e']
    >>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
    >>> continuous_cols = ['e']
    >>> column_idx = {k:v for v,k in enumerate(colnames)}
    >>> model = SAINT(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        column_idx: Dict[str, int],
        *,
        cat_embed_input: Optional[List[Tuple[str, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        shared_embed: Optional[bool] = None,
        add_shared_embed: Optional[bool] = None,
        frac_shared_embed: Optional[float] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = "standard",
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        input_dim: int = 32,
        use_qkv_bias: bool = False,
        n_heads: int = 8,
        n_blocks: int = 2,
        attn_dropout: float = 0.1,
        ff_dropout: float = 0.2,
        ff_factor: int = 4,
        transformer_activation: str = "gelu",
        mlp_hidden_dims: Optional[List[int]] = None,
        mlp_activation: Optional[str] = None,
        mlp_dropout: Optional[float] = None,
        mlp_batchnorm: Optional[bool] = None,
        mlp_batchnorm_last: Optional[bool] = None,
        mlp_linear_first: Optional[bool] = None,
    ):
        super(SAINT, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=shared_embed,
            add_shared_embed=add_shared_embed,
            frac_shared_embed=frac_shared_embed,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            input_dim=input_dim,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.use_qkv_bias = use_qkv_bias
        self.n_heads = n_heads
        self.n_blocks = n_blocks
        self.attn_dropout = attn_dropout
        self.ff_dropout = ff_dropout
        self.ff_factor = ff_factor
        self.transformer_activation = transformer_activation

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        self.with_cls_token = "cls_token" in column_idx
        self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
        self.n_cont = len(continuous_cols) if continuous_cols is not None else 0
        self.n_feats = self.n_cat + self.n_cont

        # Embeddings are instantiated at the base model
        # Transformer blocks
        self.encoder = nn.Sequential()
        for i in range(n_blocks):
            self.encoder.add_module(
                "saint_block" + str(i),
                SaintEncoder(
                    input_dim,
                    n_heads,
                    use_qkv_bias,
                    attn_dropout,
                    ff_dropout,
                    ff_factor,
                    transformer_activation,
                    self.n_feats,
                ),
            )

        self.mlp_first_hidden_dim = (
            self.input_dim if self.with_cls_token else (self.n_feats * self.input_dim)
        )

        # Mlp: adding an MLP on top of the Resnet blocks is optional and
        # therefore all related params are optional
        if self.mlp_hidden_dims is not None:
            self.mlp = MLP(
                d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
                activation=(
                    "relu" if self.mlp_activation is None else self.mlp_activation
                ),
                dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
                batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
                batchnorm_last=(
                    False
                    if self.mlp_batchnorm_last is None
                    else self.mlp_batchnorm_last
                ),
                linear_first=(
                    False if self.mlp_linear_first is None else self.mlp_linear_first
                ),
            )
        else:
            self.mlp = None

    def forward(self, X: Tensor) -> Tensor:
        x = self._get_embeddings(X)
        x = self.encoder(x)
        if self.with_cls_token:
            x = x[:, 0, :]
        else:
            x = x.flatten(1)
        if self.mlp is not None:
            x = self.mlp(x)
        return x

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class
        """
        return (
            self.mlp_hidden_dims[-1]
            if self.mlp_hidden_dims is not None
            else self.mlp_first_hidden_dim
        )

    @property
    def attention_weights(self) -> List[Tuple[Tensor, Tensor]]:
        r"""List with the attention weights. Each element of the list is a tuple
        where the first and the second elements are the column and row
        attention weights respectively

        The shape of the attention weights is:

        - column attention: $(N, H, F, F)$

        - row attention: $(1, H, N, N)$

        where $N$ is the batch size, $H$ is the number of heads and $F$ is the
        number of features/columns in the dataset
        """
        attention_weights = []
        for blk in self.encoder:
            attention_weights.append(
                (blk.col_attn.attn_weights, blk.row_attn.attn_weights)
            )
        return attention_weights

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights. Each element of the list is a tuple where the first and the second elements are the column and row attention weights respectively

The shape of the attention weights is:

  • column attention: \((N, H, F, F)\)

  • row attention: \((1, H, N, N)\)

where \(N\) is the batch size, \(H\) is the number of heads and \(F\) is the number of features/columns in the dataset

FTTransformer

Bases: BaseTabularModelWithAttention

Defines a FTTransformer model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
cat_embed_input Optional[List[Tuple[str, int]]]

List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
shared_embed Optional[bool]

Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

None
add_shared_embed Optional[bool]

The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

None
frac_shared_embed Optional[float]

The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal[batchnorm, layernorm]]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal[standard, piecewise, periodic]]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

'standard'
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
input_dim int

The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns.

64
kv_compression_factor float

By default, the FTTransformer uses Linear Attention (See Linformer: Self-Attention with Linear Complexity ). The compression factor that will be used to reduce the input sequence length. If we denote the resulting sequence length as \(k = int(kv_{compression \space factor} \times s)\) where \(s\) is the input sequence length.

0.5
kv_sharing bool

Boolean indicating if the \(E\) and \(F\) projection matrices will share weights. See Linformer: Self-Attention with Linear Complexity for details

False
n_heads int

Number of attention heads per FTTransformer block

8
use_qkv_bias bool

Boolean indicating whether or not to use bias in the Q, K, and V projection layers

False
n_blocks int

Number of FTTransformer blocks

4
attn_dropout float

Dropout that will be applied to the Linear-Attention layers

0.2
ff_dropout float

Dropout that will be applied to the FeedForward network

0.1
ff_factor float

Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4, but they use 4/3 in the paper.

1.33
transformer_activation str

Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

'reglu'
mlp_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If not provided no MLP on top of the final FTTransformer block will be used.

None
mlp_activation Optional[str]

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

None
mlp_dropout Optional[float]

float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

None
mlp_batchnorm Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_batchnorm_last Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_linear_first Optional[bool]

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

None

Attributes:

Name Type Description
encoder Module

Sequence of FTTransformer blocks

mlp Module

MLP component in the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import FTTransformer
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = FTTransformer(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/transformers/ft_transformer.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
class FTTransformer(BaseTabularModelWithAttention):
    r"""Defines a [FTTransformer model](https://arxiv.org/abs/2106.11959) that
    can be used as the `deeptabular` component of a Wide & Deep model or
    independently by itself.

    Most of the parameters for this class are `Optional` since the use of
    categorical or continuous is in fact optional (i.e. one can use
    categorical features only, continuous features only or both).

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `TabMlp` model. Required to slice the tensors. e.g. _{'education':
        0, 'relationship': 1, 'workclass': 2, ...}_.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name and number of unique values and
        embedding dimension. e.g. _[(education, 11), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    shared_embed: bool, Optional, default = None
        Boolean indicating if the embeddings will be "shared". The idea behind `shared_embed` is
        described in the Appendix A in the [TabTransformer paper](https://arxiv.org/abs/2012.06678):
        _'The goal of having column embedding is to enable the model to
        distinguish the classes in one column from those in the other
        columns'_. In other words, the idea is to let the model learn which
        column is embedded at the time. See: `pytorch_widedeep.models.transformers._layers.SharedEmbeddings`.
    add_shared_embed: bool, Optional, default = None
        The two embedding sharing strategies are: 1) add the shared embeddings
        to the column embeddings or 2) to replace the first
        `frac_shared_embed` with the shared embeddings.
        See `pytorch_widedeep.models.embeddings_layers.SharedEmbeddings`
        If 'None' is passed, it will default to 'False'.
    frac_shared_embed: float, Optional, default = None
        The fraction of embeddings that will be shared (if `add_shared_embed
        = False`) by all the different categories for one particular
        column. If 'None' is passed, it will default to 0.0.
    continuous_cols: List, Optional, default = None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer: str, Optional, default =  None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional, str, default = None,
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout: float, Optional, default = None,
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation: Optional, str, default = None,
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup: Dict[str, List[float]], Optional, default = None,
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies: int, Optional, default = None,
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma: float, Optional, default = None,
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer: bool, Optional, default = None,
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    input_dim: int, default = 64
        The so-called *dimension of the model*. Is the number of embeddings used to encode
        the categorical and/or continuous columns.
    kv_compression_factor: int, default = 0.5
        By default, the FTTransformer uses Linear Attention
        (See [Linformer: Self-Attention with Linear Complexity](https://arxiv.org/abs/2006.04768>) ).
        The compression factor that will be used to reduce the input sequence
        length. If we denote the resulting sequence length as
        $k = int(kv_{compression \space factor} \times s)$
        where $s$ is the input sequence length.
    kv_sharing: bool, default = False
        Boolean indicating if the $E$ and $F$ projection matrices
        will share weights.  See [Linformer: Self-Attention with Linear
        Complexity](https://arxiv.org/abs/2006.04768) for details
    n_heads: int, default = 8
        Number of attention heads per FTTransformer block
    use_qkv_bias: bool, default = False
        Boolean indicating whether or not to use bias in the Q, K, and V
        projection layers
    n_blocks: int, default = 4
        Number of FTTransformer blocks
    attn_dropout: float, default = 0.2
        Dropout that will be applied to the Linear-Attention layers
    ff_dropout: float, default = 0.1
        Dropout that will be applied to the FeedForward network
    ff_factor: float, default = 4 / 3
        Multiplicative factor applied to the first layer of the FF network in
        each Transformer block, This is normally set to 4, but they use 4/3
        in the paper.
    transformer_activation: str, default = "gelu"
        Transformer Encoder activation function. _'tanh'_, _'relu'_,
        _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
    mlp_hidden_dims: List, Optional, default = None
        List with the number of neurons per dense layer in the MLP. e.g:
        _[64, 32]_. If not provided no MLP on top of the final
        FTTransformer block will be used.
    mlp_activation: str, Optional, default = None
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky'_relu' and _'gelu'_ are supported.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to _'relu'_.
    mlp_dropout: float, Optional, default = None
        float with the dropout between the dense layers of the MLP.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to 0.0.
    mlp_batchnorm: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_batchnorm_last: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_linear_first: bool, Optional, default = None
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to `True`.

    Attributes
    ----------
    encoder: nn.Module
        Sequence of FTTransformer blocks
    mlp: nn.Module
        MLP component in the model

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import FTTransformer
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ['a', 'b', 'c', 'd', 'e']
    >>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
    >>> continuous_cols = ['e']
    >>> column_idx = {k:v for v,k in enumerate(colnames)}
    >>> model = FTTransformer(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        column_idx: Dict[str, int],
        *,
        cat_embed_input: Optional[List[Tuple[str, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        shared_embed: Optional[bool] = None,
        add_shared_embed: Optional[bool] = None,
        frac_shared_embed: Optional[float] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = "standard",
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        input_dim: int = 64,
        kv_compression_factor: float = 0.5,
        kv_sharing: bool = False,
        use_qkv_bias: bool = False,
        n_heads: int = 8,
        n_blocks: int = 4,
        attn_dropout: float = 0.2,
        ff_dropout: float = 0.1,
        ff_factor: float = 1.33,
        transformer_activation: str = "reglu",
        mlp_hidden_dims: Optional[List[int]] = None,
        mlp_activation: Optional[str] = None,
        mlp_dropout: Optional[float] = None,
        mlp_batchnorm: Optional[bool] = None,
        mlp_batchnorm_last: Optional[bool] = None,
        mlp_linear_first: Optional[bool] = None,
    ):
        super(FTTransformer, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=shared_embed,
            add_shared_embed=add_shared_embed,
            frac_shared_embed=frac_shared_embed,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            input_dim=input_dim,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.kv_compression_factor = kv_compression_factor
        self.kv_sharing = kv_sharing
        self.use_qkv_bias = use_qkv_bias
        self.n_heads = n_heads
        self.n_blocks = n_blocks
        self.attn_dropout = attn_dropout
        self.ff_dropout = ff_dropout
        self.ff_factor = ff_factor
        self.transformer_activation = transformer_activation

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        self.with_cls_token = "cls_token" in column_idx
        self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
        self.n_cont = len(continuous_cols) if continuous_cols is not None else 0
        self.n_feats = self.n_cat + self.n_cont

        # Embeddings are instantiated at the base model
        # Transformer blocks
        is_first = True
        self.encoder = nn.Sequential()
        for i in range(n_blocks):
            self.encoder.add_module(
                "fttransformer_block" + str(i),
                FTTransformerEncoder(
                    input_dim,
                    self.n_feats,
                    n_heads,
                    use_qkv_bias,
                    attn_dropout,
                    ff_dropout,
                    ff_factor,
                    kv_compression_factor,
                    kv_sharing,
                    transformer_activation,
                    is_first,
                ),
            )
            is_first = False

        self.mlp_first_hidden_dim = (
            self.input_dim if self.with_cls_token else (self.n_feats * self.input_dim)
        )

        # Mlp: adding an MLP on top of the Resnet blocks is optional and
        # therefore all related params are optional
        if self.mlp_hidden_dims is not None:
            self.mlp = MLP(
                d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
                activation=(
                    "relu" if self.mlp_activation is None else self.mlp_activation
                ),
                dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
                batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
                batchnorm_last=(
                    False
                    if self.mlp_batchnorm_last is None
                    else self.mlp_batchnorm_last
                ),
                linear_first=(
                    False if self.mlp_linear_first is None else self.mlp_linear_first
                ),
            )
        else:
            self.mlp = None

    def forward(self, X: Tensor) -> Tensor:
        x = self._get_embeddings(X)
        x = self.encoder(x)
        if self.with_cls_token:
            x = x[:, 0, :]
        else:
            x = x.flatten(1)
        if self.mlp is not None:
            x = self.mlp(x)
        return x

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class
        """
        return (
            self.mlp_hidden_dims[-1]
            if self.mlp_hidden_dims is not None
            else self.mlp_first_hidden_dim
        )

    @property
    def attention_weights(self) -> List[Tensor]:
        r"""List with the attention weights per block

        The shape of the attention weights is: $(N, H, F, k)$, where $N$ is
        the batch size, $H$ is the number of attention heads, $F$ is the
        number of features/columns and $k$ is the reduced sequence length or
        dimension, i.e. $k = int(kv_{compression \space factor} \times s)$
        """
        return [blk.attn.attn_weights for blk in self.encoder]

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights per block

The shape of the attention weights is: \((N, H, F, k)\), where \(N\) is the batch size, \(H\) is the number of attention heads, \(F\) is the number of features/columns and \(k\) is the reduced sequence length or dimension, i.e. \(k = int(kv_{compression \space factor} \times s)\)

TabPerceiver

Bases: BaseTabularModelWithAttention

Defines an adaptation of a Perceiver that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

ℹ️ NOTE: while there are scientific publications for the TabTransformer, SAINT and FTTransformer, the TabPerceiver and the TabFastFormer are our own adaptations of the Perceiver and the FastFormer for tabular data.

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
cat_embed_input Optional[List[Tuple[str, int]]]

List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
shared_embed Optional[bool]

Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

None
add_shared_embed Optional[bool]

The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

None
frac_shared_embed Optional[float]

The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal[batchnorm, layernorm]]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal[standard, piecewise, periodic]]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

'standard'
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
input_dim int

The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns.

32
n_cross_attns int

Number of times each perceiver block will cross attend to the input data (i.e. number of cross attention components per perceiver block). This should normally be 1. However, in the paper they describe some architectures (normally computer vision-related problems) where the Perceiver attends multiple times to the input array. Therefore, maybe multiple cross attention to the input array is also useful in some cases for tabular data 🤷 .

1
n_cross_attn_heads int

Number of attention heads for the cross attention component

4
n_latents int

Number of latents. This is the \(N\) parameter in the paper. As indicated in the paper, this number should be significantly lower than \(M\) (the number of columns in the dataset). Setting \(N\) closer to \(M\) defies the main purpose of the Perceiver, which is to overcome the transformer quadratic bottleneck

16
latent_dim int

Latent dimension.

128
n_latent_heads int

Number of attention heads per Latent Transformer

4
n_latent_blocks int

Number of transformer encoder blocks (normalised MHA + normalised FF) per Latent Transformer

4
n_perceiver_blocks int

Number of Perceiver blocks defined as [Cross Attention + Latent Transformer]

4
share_weights bool

Boolean indicating if the weights will be shared between Perceiver blocks

False
attn_dropout float

Dropout that will be applied to the Multi-Head Attention layers

0.1
ff_dropout float

Dropout that will be applied to the FeedForward network

0.1
ff_factor int

Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4.

4
transformer_activation str

Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

'geglu'
mlp_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If not provided no MLP on top of the final Transformer block will be used.

None
mlp_activation Optional[str]

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

None
mlp_dropout Optional[float]

float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

None
mlp_batchnorm Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_batchnorm_last Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_linear_first Optional[bool]

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

None

Attributes:

Name Type Description
encoder ModuleDict

ModuleDict with the Perceiver blocks

latents Parameter

Latents that will be used for prediction

mlp Module

MLP component in the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabPerceiver
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabPerceiver(column_idx=column_idx, cat_embed_input=cat_embed_input,
... continuous_cols=continuous_cols, n_latents=2, latent_dim=16,
... n_perceiver_blocks=2)
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/transformers/tab_perceiver.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
class TabPerceiver(BaseTabularModelWithAttention):
    r"""Defines an adaptation of a [Perceiver](https://arxiv.org/abs/2103.03206)
     that can be used as the `deeptabular` component of a Wide & Deep model
     or independently by itself.

    Most of the parameters for this class are `Optional` since the use of
    categorical or continuous is in fact optional (i.e. one can use
    categorical features only, continuous features only or both).

    :information_source: **NOTE**: while there are scientific publications for
     the `TabTransformer`, `SAINT` and `FTTransformer`, the `TabPerceiver`
     and the `TabFastFormer` are our own adaptations of the
     [Perceiver](https://arxiv.org/abs/2103.03206) and the
     [FastFormer](https://arxiv.org/abs/2108.09084) for tabular data.

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `TabMlp` model. Required to slice the tensors. e.g. _{'education':
        0, 'relationship': 1, 'workclass': 2, ...}_.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name and number of unique values and
        embedding dimension. e.g. _[(education, 11), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    shared_embed: bool, Optional, default = None
        Boolean indicating if the embeddings will be "shared". The idea behind `shared_embed` is
        described in the Appendix A in the [TabTransformer paper](https://arxiv.org/abs/2012.06678):
        _'The goal of having column embedding is to enable the model to
        distinguish the classes in one column from those in the other
        columns'_. In other words, the idea is to let the model learn which
        column is embedded at the time. See: `pytorch_widedeep.models.transformers._layers.SharedEmbeddings`.
    add_shared_embed: bool, Optional, default = None
        The two embedding sharing strategies are: 1) add the shared embeddings
        to the column embeddings or 2) to replace the first
        `frac_shared_embed` with the shared embeddings.
        See `pytorch_widedeep.models.embeddings_layers.SharedEmbeddings`
        If 'None' is passed, it will default to 'False'.
    frac_shared_embed: float, Optional, default = None
        The fraction of embeddings that will be shared (if `add_shared_embed
        = False`) by all the different categories for one particular
        column. If 'None' is passed, it will default to 0.0.
    continuous_cols: List, Optional, default = None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer: str, Optional, default =  None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional, str, default = None,
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout: float, Optional, default = None,
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation: Optional, str, default = None,
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup: Dict[str, List[float]], Optional, default = None,
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies: int, Optional, default = None,
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma: float, Optional, default = None,
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer: bool, Optional, default = None,
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    input_dim: int, default = 32
        The so-called *dimension of the model*. Is the number of embeddings
        used to encode the categorical and/or continuous columns.
    n_cross_attns: int, default = 1
        Number of times each perceiver block will cross attend to the input
        data (i.e. number of cross attention components per perceiver block).
        This should normally be 1. However, in the paper they describe some
        architectures (normally computer vision-related problems) where the
        Perceiver attends multiple times to the input array. Therefore, maybe
        multiple cross attention to the input array is also useful in some
        cases for tabular data :shrug: .
    n_cross_attn_heads: int, default = 4
        Number of attention heads for the cross attention component
    n_latents: int, default = 16
        Number of latents. This is the $N$ parameter in the paper. As
        indicated in the paper, this number should be significantly lower
        than $M$ (the number of columns in the dataset). Setting $N$ closer
        to $M$ defies the main purpose of the Perceiver, which is to overcome
        the transformer quadratic bottleneck
    latent_dim: int, default = 128
        Latent dimension.
    n_latent_heads: int, default = 4
        Number of attention heads per Latent Transformer
    n_latent_blocks: int, default = 4
        Number of transformer encoder blocks (normalised MHA + normalised FF)
        per Latent Transformer
    n_perceiver_blocks: int, default = 4
        Number of Perceiver blocks defined as [Cross Attention + Latent
        Transformer]
    share_weights: Boolean, default = False
        Boolean indicating if the weights will be shared between Perceiver
        blocks
    attn_dropout: float, default = 0.2
        Dropout that will be applied to the Multi-Head Attention layers
    ff_dropout: float, default = 0.1
        Dropout that will be applied to the FeedForward network
    ff_factor: float, default = 4
        Multiplicative factor applied to the first layer of the FF network in
        each Transformer block, This is normally set to 4.
    transformer_activation: str, default = "gelu"
        Transformer Encoder activation function. _'tanh'_, _'relu'_,
        _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
    mlp_hidden_dims: List, Optional, default = None
        List with the number of neurons per dense layer in the MLP. e.g:
        _[64, 32]_. If not provided no MLP on top of the final
        Transformer block will be used.
    mlp_activation: str, Optional, default = None
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky'_relu' and _'gelu'_ are supported.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to _'relu'_.
    mlp_dropout: float, Optional, default = None
        float with the dropout between the dense layers of the MLP.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to 0.0.
    mlp_batchnorm: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_batchnorm_last: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_linear_first: bool, Optional, default = None
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to `True`.

    Attributes
    ----------
    encoder: nn.ModuleDict
        ModuleDict with the Perceiver blocks
    latents: nn.Parameter
        Latents that will be used for prediction
    mlp: nn.Module
        MLP component in the model

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import TabPerceiver
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ['a', 'b', 'c', 'd', 'e']
    >>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
    >>> continuous_cols = ['e']
    >>> column_idx = {k:v for v,k in enumerate(colnames)}
    >>> model = TabPerceiver(column_idx=column_idx, cat_embed_input=cat_embed_input,
    ... continuous_cols=continuous_cols, n_latents=2, latent_dim=16,
    ... n_perceiver_blocks=2)
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        column_idx: Dict[str, int],
        *,
        cat_embed_input: Optional[List[Tuple[str, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        shared_embed: Optional[bool] = None,
        add_shared_embed: Optional[bool] = None,
        frac_shared_embed: Optional[float] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = "standard",
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        input_dim: int = 32,
        n_cross_attns: int = 1,
        n_cross_attn_heads: int = 4,
        n_latents: int = 16,
        latent_dim: int = 128,
        n_latent_heads: int = 4,
        n_latent_blocks: int = 4,
        n_perceiver_blocks: int = 4,
        share_weights: bool = False,
        attn_dropout: float = 0.1,
        ff_dropout: float = 0.1,
        ff_factor: int = 4,
        transformer_activation: str = "geglu",
        mlp_hidden_dims: Optional[List[int]] = None,
        mlp_activation: Optional[str] = None,
        mlp_dropout: Optional[float] = None,
        mlp_batchnorm: Optional[bool] = None,
        mlp_batchnorm_last: Optional[bool] = None,
        mlp_linear_first: Optional[bool] = None,
    ):
        super(TabPerceiver, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=shared_embed,
            add_shared_embed=add_shared_embed,
            frac_shared_embed=frac_shared_embed,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            input_dim=input_dim,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.n_cross_attns = n_cross_attns
        self.n_cross_attn_heads = n_cross_attn_heads
        self.n_latents = n_latents
        self.latent_dim = latent_dim
        self.n_latent_heads = n_latent_heads
        self.n_latent_blocks = n_latent_blocks
        self.n_perceiver_blocks = n_perceiver_blocks
        self.share_weights = share_weights
        self.attn_dropout = attn_dropout
        self.ff_dropout = ff_dropout
        self.ff_factor = ff_factor
        self.transformer_activation = transformer_activation

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        # Embeddings are instantiated at the base model
        # Transformer blocks
        self.latents = nn.init.trunc_normal_(
            nn.Parameter(torch.empty(n_latents, latent_dim))
        )

        self.encoder = nn.ModuleDict()
        first_perceiver_block = self._build_perceiver_block()
        self.encoder["perceiver_block0"] = first_perceiver_block

        if share_weights:
            for n in range(1, n_perceiver_blocks):
                self.encoder["perceiver_block" + str(n)] = first_perceiver_block
        else:
            for n in range(1, n_perceiver_blocks):
                self.encoder["perceiver_block" + str(n)] = self._build_perceiver_block()

        self.mlp_first_hidden_dim = self.latent_dim

        # Mlp
        if self.mlp_hidden_dims is not None:
            self.mlp = MLP(
                d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
                activation=(
                    "relu" if self.mlp_activation is None else self.mlp_activation
                ),
                dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
                batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
                batchnorm_last=(
                    False
                    if self.mlp_batchnorm_last is None
                    else self.mlp_batchnorm_last
                ),
                linear_first=(
                    False if self.mlp_linear_first is None else self.mlp_linear_first
                ),
            )
        else:
            self.mlp = None

    def forward(self, X: Tensor) -> Tensor:
        x_emb = self._get_embeddings(X)

        x = einops.repeat(self.latents, "n d -> b n d", b=X.shape[0])

        for n in range(self.n_perceiver_blocks):
            cross_attns = self.encoder["perceiver_block" + str(n)]["cross_attns"]
            latent_transformer = self.encoder["perceiver_block" + str(n)][
                "latent_transformer"
            ]
            for cross_attn in cross_attns:
                x = cross_attn(x, x_emb)
            x = latent_transformer(x)

        # average along the latent index axis
        x = x.mean(dim=1)

        if self.mlp is not None:
            x = self.mlp(x)

        return x

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class
        """
        return (
            self.mlp_hidden_dims[-1]
            if self.mlp_hidden_dims is not None
            else self.mlp_first_hidden_dim
        )

    @property
    def attention_weights(self) -> List:
        r"""List with the attention weights. If the weights are not shared
        between perceiver blocks each element of the list will be a list
        itself containing the Cross Attention and Latent Transformer
        attention weights respectively

        The shape of the attention weights is:

        - Cross Attention: $(N, C, L, F)$

        - Latent Attention: $(N, T, L, L)$

        WHere $N$ is the batch size, $C$ is the number of Cross Attention
        heads, $L$ is the number of Latents, $F$ is the number of
        features/columns in the dataset and $T$ is the number of Latent
        Attention heads
        """
        if self.share_weights:
            cross_attns = self.encoder["perceiver_block0"]["cross_attns"]
            latent_transformer = self.encoder["perceiver_block0"]["latent_transformer"]
            attention_weights = self._extract_attn_weights(
                cross_attns, latent_transformer
            )
        else:
            attention_weights = []
            for n in range(self.n_perceiver_blocks):
                cross_attns = self.encoder["perceiver_block" + str(n)]["cross_attns"]
                latent_transformer = self.encoder["perceiver_block" + str(n)][
                    "latent_transformer"
                ]
                attention_weights.append(
                    self._extract_attn_weights(cross_attns, latent_transformer)
                )
        return attention_weights

    def _build_perceiver_block(self) -> nn.ModuleDict:
        perceiver_block = nn.ModuleDict()

        # Cross Attention
        cross_attns = nn.ModuleList()
        for _ in range(self.n_cross_attns):
            cross_attns.append(
                PerceiverEncoder(
                    self.input_dim,
                    self.n_cross_attn_heads,
                    False,  # use_bias
                    self.attn_dropout,
                    self.ff_dropout,
                    self.ff_factor,
                    self.transformer_activation,
                    self.latent_dim,  # q_dim,
                ),
            )
        perceiver_block["cross_attns"] = cross_attns

        # Latent Transformer
        latent_transformer = nn.Sequential()
        for i in range(self.n_latent_blocks):
            latent_transformer.add_module(
                "latent_block" + str(i),
                PerceiverEncoder(
                    self.latent_dim,  # input_dim
                    self.n_latent_heads,
                    False,  # use_bias
                    self.attn_dropout,
                    self.ff_dropout,
                    self.ff_factor,
                    self.transformer_activation,
                ),
            )
        perceiver_block["latent_transformer"] = latent_transformer

        return perceiver_block

    @staticmethod
    def _extract_attn_weights(cross_attns, latent_transformer) -> List:
        attention_weights = []
        for cross_attn in cross_attns:
            attention_weights.append(cross_attn.attn.attn_weights)
        for latent_block in latent_transformer:
            attention_weights.append(latent_block.attn.attn_weights)
        return attention_weights

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights. If the weights are not shared between perceiver blocks each element of the list will be a list itself containing the Cross Attention and Latent Transformer attention weights respectively

The shape of the attention weights is:

  • Cross Attention: \((N, C, L, F)\)

  • Latent Attention: \((N, T, L, L)\)

WHere \(N\) is the batch size, \(C\) is the number of Cross Attention heads, \(L\) is the number of Latents, \(F\) is the number of features/columns in the dataset and \(T\) is the number of Latent Attention heads

TabFastFormer

Bases: BaseTabularModelWithAttention

Defines an adaptation of a FastFormer that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

ℹ️ NOTE: while there are scientific publications for the TabTransformer, SAINT and FTTransformer, the TabPerceiver and the TabFastFormer are our own adaptations of the Perceiver and the FastFormer for tabular data.

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
cat_embed_input Optional[List[Tuple[str, int]]]

List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
shared_embed Optional[bool]

Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

None
add_shared_embed Optional[bool]

The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

None
frac_shared_embed Optional[float]

The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal[batchnorm, layernorm]]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal[standard, piecewise, periodic]]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

'standard'
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
input_dim int

The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns

32
n_heads int

Number of attention heads per FastFormer block

8
use_bias bool

Boolean indicating whether or not to use bias in the Q, K, and V projection layers

False
n_blocks int

Number of FastFormer blocks

4
attn_dropout float

Dropout that will be applied to the Additive Attention layers

0.1
ff_dropout float

Dropout that will be applied to the FeedForward network

0.2
ff_factor int

Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4.

4
share_qv_weights bool

Following the paper, this is a boolean indicating if the Value (\(V\)) and the Query (\(Q\)) transformation parameters will be shared.

False
share_weights bool

In addition to sharing the \(V\) and \(Q\) transformation parameters, the parameters across different Fastformer layers can also be shared. Please, see pytorch_widedeep/models/tabular/transformers/tab_fastformer.py for details

False
transformer_activation str

Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

'relu'
mlp_hidden_dims Optional[List[int]]

MLP hidden dimensions. If not provided no MLP on top of the final FTTransformer block will be used

None
mlp_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If not provided no MLP on top of the final Transformer block will be used.

None
mlp_activation Optional[str]

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

None
mlp_dropout Optional[float]

float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

None
mlp_batchnorm Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_batchnorm_last Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

None
mlp_linear_first Optional[bool]

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

None

Attributes:

Name Type Description
encoder Module

Sequence of FasFormer blocks.

mlp Module

MLP component in the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabFastFormer
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabFastFormer(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/transformers/tab_fastformer.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
class TabFastFormer(BaseTabularModelWithAttention):
    r"""Defines an adaptation of a [FastFormer](https://arxiv.org/abs/2108.09084)
    that can be used as the `deeptabular` component of a Wide & Deep model
    or independently by itself.

    Most of the parameters for this class are `Optional` since the use of
    categorical or continuous is in fact optional (i.e. one can use
    categorical features only, continuous features only or both).

    :information_source: **NOTE**: while there are scientific publications for
     the `TabTransformer`, `SAINT` and `FTTransformer`, the `TabPerceiver`
     and the `TabFastFormer` are our own adaptations of the
     [Perceiver](https://arxiv.org/abs/2103.03206) and the
     [FastFormer](https://arxiv.org/abs/2108.09084) for tabular data.

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `TabMlp` model. Required to slice the tensors. e.g. _{'education':
        0, 'relationship': 1, 'workclass': 2, ...}_.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name and number of unique values and
        embedding dimension. e.g. _[(education, 11), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    shared_embed: bool, Optional, default = None
        Boolean indicating if the embeddings will be "shared". The idea behind `shared_embed` is
        described in the Appendix A in the [TabTransformer paper](https://arxiv.org/abs/2012.06678):
        _'The goal of having column embedding is to enable the model to
        distinguish the classes in one column from those in the other
        columns'_. In other words, the idea is to let the model learn which
        column is embedded at the time. See: `pytorch_widedeep.models.transformers._layers.SharedEmbeddings`.
    add_shared_embed: bool, Optional, default = None
        The two embedding sharing strategies are: 1) add the shared embeddings
        to the column embeddings or 2) to replace the first
        `frac_shared_embed` with the shared embeddings.
        See `pytorch_widedeep.models.embeddings_layers.SharedEmbeddings`
        If 'None' is passed, it will default to 'False'.
    frac_shared_embed: float, Optional, default = None
        The fraction of embeddings that will be shared (if `add_shared_embed
        = False`) by all the different categories for one particular
        column. If 'None' is passed, it will default to 0.0.
    continuous_cols: List, Optional, default = None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer: str, Optional, default =  None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional, str, default = None,
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout: float, Optional, default = None,
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation: Optional, str, default = None,
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup: Dict[str, List[float]], Optional, default = None,
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies: int, Optional, default = None,
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma: float, Optional, default = None,
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer: bool, Optional, default = None,
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    input_dim: int, default = 32
        The so-called *dimension of the model*. Is the number of
        embeddings used to encode the categorical and/or continuous columns
    n_heads: int, default = 8
        Number of attention heads per FastFormer block
    use_bias: bool, default = False
        Boolean indicating whether or not to use bias in the Q, K, and V
        projection layers
    n_blocks: int, default = 4
        Number of FastFormer blocks
    attn_dropout: float, default = 0.2
        Dropout that will be applied to the Additive Attention layers
    ff_dropout: float, default = 0.1
        Dropout that will be applied to the FeedForward network
    ff_factor: float, default = 4
        Multiplicative factor applied to the first layer of the FF network in
        each Transformer block, This is normally set to 4.
    share_qv_weights: bool, default = False
        Following the paper, this is a boolean indicating if the Value ($V$) and
        the Query ($Q$) transformation parameters will be shared.
    share_weights: bool, default = False
        In addition to sharing the $V$ and $Q$ transformation parameters, the
        parameters across different Fastformer layers can also be shared.
        Please, see
        `pytorch_widedeep/models/tabular/transformers/tab_fastformer.py` for
        details
    transformer_activation: str, default = "gelu"
        Transformer Encoder activation function. _'tanh'_, _'relu'_,
        _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
    mlp_hidden_dims: List, Optional, default = None
        MLP hidden dimensions. If not provided no MLP on top of the final
        FTTransformer block will be used
    mlp_hidden_dims: List, Optional, default = None
        List with the number of neurons per dense layer in the MLP. e.g:
        _[64, 32]_. If not provided no MLP on top of the final
        Transformer block will be used.
    mlp_activation: str, Optional, default = None
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky'_relu' and _'gelu'_ are supported.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to _'relu'_.
    mlp_dropout: float, Optional, default = None
        float with the dropout between the dense layers of the MLP.
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to 0.0.
    mlp_batchnorm: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_batchnorm_last: bool, Optional, default = None
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to False.
    mlp_linear_first: bool, Optional, default = None
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`
        If 'mlp_hidden_dims' is not `None` and this parameter is `None`, it
        will default to `True`.

    Attributes
    ----------
    encoder: nn.Module
        Sequence of FasFormer blocks.
    mlp: nn.Module
        MLP component in the model

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import TabFastFormer
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ['a', 'b', 'c', 'd', 'e']
    >>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
    >>> continuous_cols = ['e']
    >>> column_idx = {k:v for v,k in enumerate(colnames)}
    >>> model = TabFastFormer(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        column_idx: Dict[str, int],
        *,
        cat_embed_input: Optional[List[Tuple[str, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        shared_embed: Optional[bool] = None,
        add_shared_embed: Optional[bool] = None,
        frac_shared_embed: Optional[float] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = "standard",
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        input_dim: int = 32,
        n_heads: int = 8,
        use_bias: bool = False,
        n_blocks: int = 4,
        attn_dropout: float = 0.1,
        ff_dropout: float = 0.2,
        ff_factor: int = 4,
        share_qv_weights: bool = False,
        share_weights: bool = False,
        transformer_activation: str = "relu",
        mlp_hidden_dims: Optional[List[int]] = None,
        mlp_activation: Optional[str] = None,
        mlp_dropout: Optional[float] = None,
        mlp_batchnorm: Optional[bool] = None,
        mlp_batchnorm_last: Optional[bool] = None,
        mlp_linear_first: Optional[bool] = None,
    ):
        super(TabFastFormer, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=shared_embed,
            add_shared_embed=add_shared_embed,
            frac_shared_embed=frac_shared_embed,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            input_dim=input_dim,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.n_heads = n_heads
        self.use_bias = use_bias
        self.n_blocks = n_blocks
        self.attn_dropout = attn_dropout
        self.ff_dropout = ff_dropout
        self.ff_factor = ff_factor
        self.share_qv_weights = share_qv_weights
        self.share_weights = share_weights
        self.transformer_activation = transformer_activation

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        self.with_cls_token = "cls_token" in column_idx
        self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
        self.n_cont = len(continuous_cols) if continuous_cols is not None else 0
        self.n_feats = self.n_cat + self.n_cont

        # Embeddings are instantiated at the base model
        # Transformer blocks
        self.encoder = nn.Sequential()
        first_fastformer_block = FastFormerEncoder(
            input_dim,
            n_heads,
            use_bias,
            attn_dropout,
            ff_dropout,
            ff_factor,
            share_qv_weights,
            transformer_activation,
        )
        self.encoder.add_module("fastformer_block0", first_fastformer_block)
        for i in range(1, n_blocks):
            if share_weights:
                self.encoder.add_module(
                    "fastformer_block" + str(i), first_fastformer_block
                )
            else:
                self.encoder.add_module(
                    "fastformer_block" + str(i),
                    FastFormerEncoder(
                        input_dim,
                        n_heads,
                        use_bias,
                        attn_dropout,
                        ff_dropout,
                        ff_factor,
                        share_qv_weights,
                        transformer_activation,
                    ),
                )

        self.mlp_first_hidden_dim = (
            self.input_dim if self.with_cls_token else (self.n_feats * self.input_dim)
        )

        # Mlp: adding an MLP on top of the Resnet blocks is optional and
        # therefore all related params are optional
        if self.mlp_hidden_dims is not None:
            self.mlp = MLP(
                d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
                activation=(
                    "relu" if self.mlp_activation is None else self.mlp_activation
                ),
                dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
                batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
                batchnorm_last=(
                    False
                    if self.mlp_batchnorm_last is None
                    else self.mlp_batchnorm_last
                ),
                linear_first=(
                    False if self.mlp_linear_first is None else self.mlp_linear_first
                ),
            )
        else:
            self.mlp = None

    def forward(self, X: Tensor) -> Tensor:
        x = self._get_embeddings(X)
        x = self.encoder(x)
        if self.with_cls_token:
            x = x[:, 0, :]
        else:
            x = x.flatten(1)
        if self.mlp is not None:
            x = self.mlp(x)
        return x

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class
        """
        return (
            self.mlp_hidden_dims[-1]
            if self.mlp_hidden_dims is not None
            else self.mlp_first_hidden_dim
        )

    @property
    def attention_weights(self) -> List[Tuple[Tensor, Tensor]]:
        r"""List with the attention weights. Each element of the list is a
        tuple where the first and second elements are the $\alpha$
        and $\beta$ attention weights in the paper.

        The shape of the attention weights is $(N, H, F)$ where $N$ is the
        batch size, $H$ is the number of attention heads and $F$ is the
        number of features/columns in the dataset
        """
        if self.share_weights:
            attention_weights = [self.encoder[0].attn.attn_weight]
        else:
            attention_weights = [blk.attn.attn_weights for blk in self.encoder]
        return attention_weights

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights. Each element of the list is a tuple where the first and second elements are the \(\alpha\) and \(\beta\) attention weights in the paper.

The shape of the attention weights is \((N, H, F)\) where \(N\) is the batch size, \(H\) is the number of attention heads and \(F\) is the number of features/columns in the dataset

BasicRNN

Bases: BaseWDModelComponent

Standard text classifier/regressor comprised by a stack of RNNs (LSTMs or GRUs) that can be used as the deeptext component of a Wide & Deep model or independently by itself.

In addition, there is the option to add a Fully Connected (FC) set of dense layers on top of the stack of RNNs

Parameters:

Name Type Description Default
vocab_size int

Number of words in the vocabulary

required
embed_dim Optional[int]

Dimension of the word embeddings if non-pretained word vectors are used

None
embed_matrix Optional[ndarray]

Pretrained word embeddings

None
embed_trainable bool

Boolean indicating if the pretrained embeddings are trainable

True
rnn_type Literal[lstm, gru]

String indicating the type of RNN to use. One of 'lstm' or 'gru'

'lstm'
hidden_dim int

Hidden dim of the RNN

64
n_layers int

Number of recurrent layers

3
rnn_dropout float

Dropout for each RNN layer except the last layer

0.0
bidirectional bool

Boolean indicating whether the staked RNNs are bidirectional

False
use_hidden_state bool

Boolean indicating whether to use the final hidden state or the RNN's output as predicting features. Typically the former is used.

True
padding_idx int

index of the padding token in the padded-tokenised sequences. The TextPreprocessor class within this library uses fastai's tokenizer where the token index 0 is reserved for the 'unknown' word token. Therefore, the default value is set to 1.

1
head_hidden_dims Optional[List[int]]

List with the sizes of the dense layers in the head e.g: [128, 64]

None
head_activation str

Activation function for the dense layers in the head. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
head_dropout Optional[float]

Dropout of the dense layers in the head

None
head_batchnorm bool

Boolean indicating whether or not to include batch normalization in the dense layers that form the 'rnn_mlp'

False
head_batchnorm_last bool

Boolean indicating whether or not to apply batch normalization to the last of the dense layers in the head

False
head_linear_first bool

Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

False

Attributes:

Name Type Description
word_embed Module

word embedding matrix

rnn Module

Stack of RNNs

rnn_mlp Module

Stack of dense layers on top of the RNN. This will only exists if head_layers_dim is not None

Examples:

>>> import torch
>>> from pytorch_widedeep.models import BasicRNN
>>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
>>> model = BasicRNN(vocab_size=4, hidden_dim=4, n_layers=2, padding_idx=0, embed_dim=4)
>>> out = model(X_text)
Source code in pytorch_widedeep/models/text/rnns/basic_rnn.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
class BasicRNN(BaseWDModelComponent):
    r"""Standard text classifier/regressor comprised by a stack of RNNs
    (LSTMs or GRUs) that can be used as the `deeptext` component of a Wide &
    Deep model or independently by itself.

    In addition, there is the option to add a Fully Connected (FC) set of
    dense layers on top of the stack of RNNs

    Parameters
    ----------
    vocab_size: int
        Number of words in the vocabulary
    embed_dim: int, Optional, default = None
        Dimension of the word embeddings if non-pretained word vectors are
        used
    embed_matrix: np.ndarray, Optional, default = None
        Pretrained word embeddings
    embed_trainable: bool, default = True
        Boolean indicating if the pretrained embeddings are trainable
    rnn_type: str, default = 'lstm'
        String indicating the type of RNN to use. One of _'lstm'_ or _'gru'_
    hidden_dim: int, default = 64
        Hidden dim of the RNN
    n_layers: int, default = 3
        Number of recurrent layers
    rnn_dropout: float, default = 0.1
        Dropout for each RNN layer except the last layer
    bidirectional: bool, default = True
        Boolean indicating whether the staked RNNs are bidirectional
    use_hidden_state: str, default = True
        Boolean indicating whether to use the final hidden state or the RNN's
        output as predicting features. Typically the former is used.
    padding_idx: int, default = 1
        index of the padding token in the padded-tokenised sequences. The
        `TextPreprocessor` class within this library uses fastai's tokenizer
        where the token index 0 is reserved for the _'unknown'_ word token.
        Therefore, the default value is set to 1.
    head_hidden_dims: List, Optional, default = None
        List with the sizes of the dense layers in the head e.g: _[128, 64]_
    head_activation: str, default = "relu"
        Activation function for the dense layers in the head. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    head_dropout: float, Optional, default = None
        Dropout of the dense layers in the head
    head_batchnorm: bool, default = False
        Boolean indicating whether or not to include batch normalization in
        the dense layers that form the _'rnn_mlp'_
    head_batchnorm_last: bool, default = False
        Boolean indicating whether or not to apply batch normalization to the
        last of the dense layers in the head
    head_linear_first: bool, default = False
        Boolean indicating whether the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    word_embed: nn.Module
        word embedding matrix
    rnn: nn.Module
        Stack of RNNs
    rnn_mlp: nn.Module
        Stack of dense layers on top of the RNN. This will only exists if
        `head_layers_dim` is not None

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import BasicRNN
    >>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
    >>> model = BasicRNN(vocab_size=4, hidden_dim=4, n_layers=2, padding_idx=0, embed_dim=4)
    >>> out = model(X_text)
    """

    def __init__(
        self,
        vocab_size: int,
        embed_dim: Optional[int] = None,
        embed_matrix: Optional[np.ndarray] = None,
        embed_trainable: bool = True,
        rnn_type: Literal["lstm", "gru"] = "lstm",
        hidden_dim: int = 64,
        n_layers: int = 3,
        rnn_dropout: float = 0.0,
        bidirectional: bool = False,
        use_hidden_state: bool = True,
        padding_idx: int = 1,
        head_hidden_dims: Optional[List[int]] = None,
        head_activation: str = "relu",
        head_dropout: Optional[float] = None,
        head_batchnorm: bool = False,
        head_batchnorm_last: bool = False,
        head_linear_first: bool = False,
    ):
        super(BasicRNN, self).__init__()

        if embed_dim is None and embed_matrix is None:
            raise ValueError(
                "If no 'embed_matrix' is passed, the embedding dimension must"
                "be specified with 'embed_dim'"
            )

        if rnn_type.lower() not in ["lstm", "gru"]:
            raise ValueError(
                f"'rnn_type' must be 'lstm' or 'gru', got {rnn_type} instead"
            )

        if (
            embed_dim is not None
            and embed_matrix is not None
            and not embed_dim == embed_matrix.shape[1]
        ):
            warnings.warn(
                "the input embedding dimension {} and the dimension of the "
                "pretrained embeddings {} do not match. The pretrained embeddings "
                "dimension ({}) will be used".format(
                    embed_dim, embed_matrix.shape[1], embed_matrix.shape[1]
                ),
                UserWarning,
            )

        self.vocab_size = vocab_size
        self.embed_trainable = embed_trainable
        self.embed_dim = embed_dim

        self.rnn_type = rnn_type
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.rnn_dropout = rnn_dropout
        self.bidirectional = bidirectional
        self.use_hidden_state = use_hidden_state
        self.padding_idx = padding_idx

        self.head_hidden_dims = head_hidden_dims
        self.head_activation = head_activation
        self.head_dropout = head_dropout
        self.head_batchnorm = head_batchnorm
        self.head_batchnorm_last = head_batchnorm_last
        self.head_linear_first = head_linear_first

        # Embeddings
        if embed_matrix is not None:
            self.word_embed, self.embed_dim = self._set_embeddings(embed_matrix)
        else:
            assert self.embed_dim is not None
            self.word_embed = nn.Embedding(
                self.vocab_size, self.embed_dim, padding_idx=self.padding_idx
            )

        # RNN
        rnn_params = {
            "input_size": self.embed_dim,
            "hidden_size": hidden_dim,
            "num_layers": n_layers,
            "bidirectional": bidirectional,
            "dropout": rnn_dropout,
            "batch_first": True,
        }
        if self.rnn_type.lower() == "lstm":
            self.rnn: Union[nn.LSTM, nn.GRU] = nn.LSTM(**rnn_params)
        elif self.rnn_type.lower() == "gru":
            self.rnn = nn.GRU(**rnn_params)
        else:
            raise ValueError(
                f"'rnn_type' must be 'lstm' or 'gru', got {self.rnn_type} instead"
            )

        self.rnn_output_dim = hidden_dim * 2 if bidirectional else hidden_dim

        # FC-Head (Mlp)
        if self.head_hidden_dims is not None:
            head_hidden_dims = [self.rnn_output_dim] + self.head_hidden_dims
            self.rnn_mlp: Union[MLP, nn.Identity] = MLP(
                head_hidden_dims,
                head_activation,
                head_dropout,
                head_batchnorm,
                head_batchnorm_last,
                head_linear_first,
            )
        else:
            # simple hack to add readability in the forward pass
            self.rnn_mlp = nn.Identity()

    def forward(self, X: Tensor) -> Tensor:
        embed = self.word_embed(X.long())

        if self.rnn_type.lower() == "lstm":
            o, (h, c) = self.rnn(embed)
        elif self.rnn_type.lower() == "gru":
            o, h = self.rnn(embed)
        else:
            raise ValueError(
                f"'rnn_type' must be 'lstm' or 'gru', got {self.rnn_type} instead"
            )

        processed_outputs = self._process_rnn_outputs(o, h)

        return self.rnn_mlp(processed_outputs)

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class
        """
        return (
            self.head_hidden_dims[-1]
            if self.head_hidden_dims is not None
            else self.rnn_output_dim
        )

    def _set_embeddings(
        self, embed_matrix: Optional[np.ndarray] = None
    ) -> Tuple[nn.Module, int]:
        if embed_matrix is not None:
            assert (
                embed_matrix.dtype == "float32"
            ), "'embed_matrix' must be of dtype 'float32', got dtype '{}'".format(
                str(embed_matrix.dtype)
            )
            word_embed = nn.Embedding(
                self.vocab_size, embed_matrix.shape[1], padding_idx=self.padding_idx
            )
            if self.embed_trainable:
                word_embed.weight = nn.Parameter(
                    torch.tensor(embed_matrix), requires_grad=True
                )
            else:
                word_embed.weight = nn.Parameter(
                    torch.tensor(embed_matrix), requires_grad=False
                )
            embed_dim = embed_matrix.shape[1]
        else:
            assert self.embed_dim is not None
            word_embed = nn.Embedding(
                self.vocab_size, self.embed_dim, padding_idx=self.padding_idx
            )
            embed_dim = self.embed_dim

        return word_embed, embed_dim

    def _process_rnn_outputs(self, output: Tensor, hidden: Tensor) -> Tensor:
        output = output.permute(1, 0, 2)
        if self.bidirectional:
            processed_outputs = (
                torch.cat((hidden[-2], hidden[-1]), dim=1)
                if self.use_hidden_state
                else output[-1]
            )
        else:
            processed_outputs = hidden[-1] if self.use_hidden_state else output[-1]

        return processed_outputs

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

AttentiveRNN

Bases: BasicRNN

Text classifier/regressor comprised by a stack of RNNs (LSTMs or GRUs) plus an attention layer. This model can be used as the deeptext component of a Wide & Deep model or independently by itself.

In addition, there is the option to add a Fully Connected (FC) set of dense layers on top of attention layer

Parameters:

Name Type Description Default
vocab_size int

Number of words in the vocabulary

required
embed_dim Optional[int]

Dimension of the word embeddings if non-pretained word vectors are used

None
embed_matrix Optional[ndarray]

Pretrained word embeddings

None
embed_trainable bool

Boolean indicating if the pretrained embeddings are trainable

True
rnn_type Literal[lstm, gru]

String indicating the type of RNN to use. One of 'lstm' or 'gru'

'lstm'
hidden_dim int

Hidden dim of the RNN

64
n_layers int

Number of recurrent layers

3
rnn_dropout float

Dropout for each RNN layer except the last layer

0.1
bidirectional bool

Boolean indicating whether the staked RNNs are bidirectional

False
use_hidden_state bool

Boolean indicating whether to use the final hidden state or the RNN's output as predicting features. Typically the former is used.

True
padding_idx int

index of the padding token in the padded-tokenised sequences. The TextPreprocessor class within this library uses fastai's tokenizer where the token index 0 is reserved for the 'unknown' word token. Therefore, the default value is set to 1.

1
attn_concatenate bool

Boolean indicating if the input to the attention mechanism will be the output of the RNN or the output of the RNN concatenated with the last hidden state.

True
attn_dropout float

Internal dropout for the attention mechanism

0.1
head_hidden_dims Optional[List[int]]

List with the sizes of the dense layers in the head e.g: [128, 64]

None
head_activation str

Activation function for the dense layers in the head. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
head_dropout Optional[float]

Dropout of the dense layers in the head

None
head_batchnorm bool

Boolean indicating whether or not to include batch normalization in the dense layers that form the 'rnn_mlp'

False
head_batchnorm_last bool

Boolean indicating whether or not to apply batch normalization to the last of the dense layers in the head

False
head_linear_first bool

Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

False

Attributes:

Name Type Description
word_embed Module

word embedding matrix

rnn Module

Stack of RNNs

rnn_mlp Module

Stack of dense layers on top of the RNN. This will only exists if head_layers_dim is not None

Examples:

>>> import torch
>>> from pytorch_widedeep.models import AttentiveRNN
>>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
>>> model = AttentiveRNN(vocab_size=4, hidden_dim=4, n_layers=2, padding_idx=0, embed_dim=4)
>>> out = model(X_text)
Source code in pytorch_widedeep/models/text/rnns/attentive_rnn.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
class AttentiveRNN(BasicRNN):
    r"""Text classifier/regressor comprised by a stack of RNNs
    (LSTMs or GRUs) plus an attention layer. This model can be used as the
    `deeptext` component of a Wide & Deep model or independently by
    itself.

    In addition, there is the option to add a Fully Connected (FC) set of dense
    layers on top of attention layer

    Parameters
    ----------
    vocab_size: int
        Number of words in the vocabulary
    embed_dim: int, Optional, default = None
        Dimension of the word embeddings if non-pretained word vectors are
        used
    embed_matrix: np.ndarray, Optional, default = None
        Pretrained word embeddings
    embed_trainable: bool, default = True
        Boolean indicating if the pretrained embeddings are trainable
    rnn_type: str, default = 'lstm'
        String indicating the type of RNN to use. One of _'lstm'_ or _'gru'_
    hidden_dim: int, default = 64
        Hidden dim of the RNN
    n_layers: int, default = 3
        Number of recurrent layers
    rnn_dropout: float, default = 0.1
        Dropout for each RNN layer except the last layer
    bidirectional: bool, default = True
        Boolean indicating whether the staked RNNs are bidirectional
    use_hidden_state: str, default = True
        Boolean indicating whether to use the final hidden state or the RNN's
        output as predicting features. Typically the former is used.
    padding_idx: int, default = 1
        index of the padding token in the padded-tokenised sequences. The
        `TextPreprocessor` class within this library uses fastai's
        tokenizer where the token index 0 is reserved for the _'unknown'_
        word token. Therefore, the default value is set to 1.
    attn_concatenate: bool, default = True
        Boolean indicating if the input to the attention mechanism will be the
        output of the RNN or the output of the RNN concatenated with the last
        hidden state.
    attn_dropout: float, default = 0.1
        Internal dropout for the attention mechanism
    head_hidden_dims: List, Optional, default = None
        List with the sizes of the dense layers in the head e.g: _[128, 64]_
    head_activation: str, default = "relu"
        Activation function for the dense layers in the head. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    head_dropout: float, Optional, default = None
        Dropout of the dense layers in the head
    head_batchnorm: bool, default = False
        Boolean indicating whether or not to include batch normalization in
        the dense layers that form the _'rnn_mlp'_
    head_batchnorm_last: bool, default = False
        Boolean indicating whether or not to apply batch normalization to the
        last of the dense layers in the head
    head_linear_first: bool, default = False
        Boolean indicating whether the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    word_embed: nn.Module
        word embedding matrix
    rnn: nn.Module
        Stack of RNNs
    rnn_mlp: nn.Module
        Stack of dense layers on top of the RNN. This will only exists if
        `head_layers_dim` is not `None`

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import AttentiveRNN
    >>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
    >>> model = AttentiveRNN(vocab_size=4, hidden_dim=4, n_layers=2, padding_idx=0, embed_dim=4)
    >>> out = model(X_text)
    """

    def __init__(
        self,
        vocab_size: int,
        embed_dim: Optional[int] = None,
        embed_matrix: Optional[np.ndarray] = None,
        embed_trainable: bool = True,
        rnn_type: Literal["lstm", "gru"] = "lstm",
        hidden_dim: int = 64,
        n_layers: int = 3,
        rnn_dropout: float = 0.1,
        bidirectional: bool = False,
        use_hidden_state: bool = True,
        padding_idx: int = 1,
        attn_concatenate: bool = True,
        attn_dropout: float = 0.1,
        head_hidden_dims: Optional[List[int]] = None,
        head_activation: str = "relu",
        head_dropout: Optional[float] = None,
        head_batchnorm: bool = False,
        head_batchnorm_last: bool = False,
        head_linear_first: bool = False,
    ):
        super(AttentiveRNN, self).__init__(
            vocab_size=vocab_size,
            embed_dim=embed_dim,
            embed_matrix=embed_matrix,
            embed_trainable=embed_trainable,
            rnn_type=rnn_type,
            hidden_dim=hidden_dim,
            n_layers=n_layers,
            rnn_dropout=rnn_dropout,
            bidirectional=bidirectional,
            use_hidden_state=use_hidden_state,
            padding_idx=padding_idx,
            head_hidden_dims=head_hidden_dims,
            head_activation=head_activation,
            head_dropout=head_dropout,
            head_batchnorm=head_batchnorm,
            head_batchnorm_last=head_batchnorm_last,
            head_linear_first=head_linear_first,
        )

        # Embeddings and RNN defined in the BasicRNN inherited class

        # Attention
        self.attn_concatenate = attn_concatenate
        self.attn_dropout = attn_dropout

        if bidirectional and attn_concatenate:
            self.rnn_output_dim = hidden_dim * 4
        elif bidirectional or attn_concatenate:
            self.rnn_output_dim = hidden_dim * 2
        else:
            self.rnn_output_dim = hidden_dim
        self.attn = ContextAttention(
            self.rnn_output_dim, attn_dropout, sum_along_seq=True
        )

        # FC-Head (Mlp)
        if self.head_hidden_dims is not None:
            head_hidden_dims = [self.rnn_output_dim] + self.head_hidden_dims
            self.rnn_mlp = MLP(
                head_hidden_dims,
                head_activation,
                head_dropout,
                head_batchnorm,
                head_batchnorm_last,
                head_linear_first,
            )

    def _process_rnn_outputs(self, output: Tensor, hidden: Tensor) -> Tensor:
        if self.attn_concatenate:
            if self.bidirectional:
                bi_hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
                attn_inp = torch.cat(
                    [output, bi_hidden.unsqueeze(1).expand_as(output)], dim=2
                )
            else:
                attn_inp = torch.cat(
                    [output, hidden[-1].unsqueeze(1).expand_as(output)], dim=2
                )
        else:
            attn_inp = output

        return self.attn(attn_inp)

    @property
    def attention_weights(self) -> List:
        r"""List with the attention weights

        The shape of the attention weights is $(N, S)$, where $N$ is the batch
        size and $S$ is the length of the sequence
        """
        return self.attn.attn_weights

attention_weights property

attention_weights

List with the attention weights

The shape of the attention weights is \((N, S)\), where \(N\) is the batch size and \(S\) is the length of the sequence

StackedAttentiveRNN

Bases: BaseWDModelComponent

Text classifier/regressor comprised by a stack of blocks: [RNN + Attention]. This can be used as the deeptext component of a Wide & Deep model or independently by itself.

In addition, there is the option to add a Fully Connected (FC) set of dense layers on top of the attentiob blocks

Parameters:

Name Type Description Default
vocab_size int

Number of words in the vocabulary

required
embed_dim Optional[int]

Dimension of the word embeddings if non-pretained word vectors are used

None
embed_matrix Optional[ndarray]

Pretrained word embeddings

None
embed_trainable bool

Boolean indicating if the pretrained embeddings are trainable

True
rnn_type Literal[lstm, gru]

String indicating the type of RNN to use. One of 'lstm' or 'gru'

'lstm'
hidden_dim int

Hidden dim of the RNN

64
bidirectional bool

Boolean indicating whether the staked RNNs are bidirectional

False
padding_idx int

index of the padding token in the padded-tokenised sequences. The TextPreprocessor class within this library uses fastai's tokenizer where the token index 0 is reserved for the 'unknown' word token. Therefore, the default value is set to 1.

1
n_blocks int

Number of attention blocks. Each block is comprised by an RNN and a Context Attention Encoder

3
attn_concatenate bool

Boolean indicating if the input to the attention mechanism will be the output of the RNN or the output of the RNN concatenated with the last hidden state or simply

False
attn_dropout float

Internal dropout for the attention mechanism

0.1
with_addnorm bool

Boolean indicating if the output of each block will be added to the input and normalised

False
head_hidden_dims Optional[List[int]]

List with the sizes of the dense layers in the head e.g: [128, 64]

None
head_activation str

Activation function for the dense layers in the head. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
head_dropout Optional[float]

Dropout of the dense layers in the head

None
head_batchnorm bool

Boolean indicating whether or not to include batch normalization in the dense layers that form the 'rnn_mlp'

False
head_batchnorm_last bool

Boolean indicating whether or not to apply batch normalization to the last of the dense layers in the head

False
head_linear_first bool

Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

False

Attributes:

Name Type Description
word_embed Module

word embedding matrix

rnn Module

Stack of RNNs

rnn_mlp Module

Stack of dense layers on top of the RNN. This will only exists if head_layers_dim is not None

Examples:

>>> import torch
>>> from pytorch_widedeep.models import StackedAttentiveRNN
>>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
>>> model = StackedAttentiveRNN(vocab_size=4, hidden_dim=4, padding_idx=0, embed_dim=4)
>>> out = model(X_text)
Source code in pytorch_widedeep/models/text/rnns/stacked_attentive_rnn.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
class StackedAttentiveRNN(BaseWDModelComponent):
    r"""Text classifier/regressor comprised by a stack of blocks:
    `[RNN + Attention]`. This can be used as the `deeptext` component of a
    Wide & Deep model or independently by itself.

    In addition, there is the option to add a Fully Connected (FC) set of
    dense layers on top of the attentiob blocks

    Parameters
    ----------
    vocab_size: int
        Number of words in the vocabulary
    embed_dim: int, Optional, default = None
        Dimension of the word embeddings if non-pretained word vectors are
        used
    embed_matrix: np.ndarray, Optional, default = None
        Pretrained word embeddings
    embed_trainable: bool, default = True
        Boolean indicating if the pretrained embeddings are trainable
    rnn_type: str, default = 'lstm'
        String indicating the type of RNN to use. One of 'lstm' or 'gru'
    hidden_dim: int, default = 64
        Hidden dim of the RNN
    bidirectional: bool, default = True
        Boolean indicating whether the staked RNNs are bidirectional
    padding_idx: int, default = 1
        index of the padding token in the padded-tokenised sequences. The
        `TextPreprocessor` class within this library uses fastai's
        tokenizer where the token index 0 is reserved for the _'unknown'_
        word token. Therefore, the default value is set to 1.
    n_blocks: int, default = 3
        Number of attention blocks. Each block is comprised by an RNN and a
        Context Attention Encoder
    attn_concatenate: bool, default = True
        Boolean indicating if the input to the attention mechanism will be the
        output of the RNN or the output of the RNN concatenated with the last
        hidden state or simply
    attn_dropout: float, default = 0.1
        Internal dropout for the attention mechanism
    with_addnorm: bool, default = False
        Boolean indicating if the output of each block will be added to the
        input and normalised
    head_hidden_dims: List, Optional, default = None
        List with the sizes of the dense layers in the head e.g: [128, 64]
    head_activation: str, default = "relu"
        Activation function for the dense layers in the head. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    head_dropout: float, Optional, default = None
        Dropout of the dense layers in the head
    head_batchnorm: bool, default = False
        Boolean indicating whether or not to include batch normalization in
        the dense layers that form the _'rnn_mlp'_
    head_batchnorm_last: bool, default = False
        Boolean indicating whether or not to apply batch normalization to the
        last of the dense layers in the head
    head_linear_first: bool, default = False
        Boolean indicating whether the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    word_embed: nn.Module
        word embedding matrix
    rnn: nn.Module
        Stack of RNNs
    rnn_mlp: nn.Module
        Stack of dense layers on top of the RNN. This will only exists if
        `head_layers_dim` is not `None`

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import StackedAttentiveRNN
    >>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
    >>> model = StackedAttentiveRNN(vocab_size=4, hidden_dim=4, padding_idx=0, embed_dim=4)
    >>> out = model(X_text)
    """

    def __init__(
        self,
        vocab_size: int,
        embed_dim: Optional[int] = None,
        embed_matrix: Optional[np.ndarray] = None,
        embed_trainable: bool = True,
        rnn_type: Literal["lstm", "gru"] = "lstm",
        hidden_dim: int = 64,
        bidirectional: bool = False,
        padding_idx: int = 1,
        n_blocks: int = 3,
        attn_concatenate: bool = False,
        attn_dropout: float = 0.1,
        with_addnorm: bool = False,
        head_hidden_dims: Optional[List[int]] = None,
        head_activation: str = "relu",
        head_dropout: Optional[float] = None,
        head_batchnorm: bool = False,
        head_batchnorm_last: bool = False,
        head_linear_first: bool = False,
    ):
        super(StackedAttentiveRNN, self).__init__()

        if (
            embed_dim is not None
            and embed_matrix is not None
            and not embed_dim == embed_matrix.shape[1]
        ):
            warnings.warn(
                "the input embedding dimension {} and the dimension of the "
                "pretrained embeddings {} do not match. The pretrained embeddings "
                "dimension ({}) will be used".format(
                    embed_dim, embed_matrix.shape[1], embed_matrix.shape[1]
                ),
                UserWarning,
            )

        if rnn_type.lower() not in ["lstm", "gru"]:
            raise ValueError(
                f"'rnn_type' must be 'lstm' or 'gru', got {rnn_type} instead"
            )

        self.vocab_size = vocab_size
        self.embed_trainable = embed_trainable
        self.embed_dim = embed_dim

        self.rnn_type = rnn_type
        self.hidden_dim = hidden_dim
        self.bidirectional = bidirectional
        self.padding_idx = padding_idx

        self.n_blocks = n_blocks
        self.attn_concatenate = attn_concatenate
        self.attn_dropout = attn_dropout
        self.with_addnorm = with_addnorm

        self.head_hidden_dims = head_hidden_dims
        self.head_activation = head_activation
        self.head_dropout = head_dropout
        self.head_batchnorm = head_batchnorm
        self.head_batchnorm_last = head_batchnorm_last
        self.head_linear_first = head_linear_first

        # Embeddings
        self.word_embed, self.embed_dim = self._set_embeddings(embed_matrix)

        # Linear Projection: if embed_dim is different that the input of the
        # attention blocks we add a linear projection
        if bidirectional and attn_concatenate:
            self.rnn_output_dim = hidden_dim * 4
        elif bidirectional or attn_concatenate:
            self.rnn_output_dim = hidden_dim * 2
        else:
            self.rnn_output_dim = hidden_dim

        if self.rnn_output_dim != self.embed_dim:
            self.embed_proj: Union[nn.Linear, nn.Identity] = nn.Linear(
                self.embed_dim, self.rnn_output_dim
            )
        else:
            self.embed_proj = nn.Identity()

        # RNN
        rnn_params = {
            "input_size": self.rnn_output_dim,
            "hidden_size": hidden_dim,
            "bidirectional": bidirectional,
            "batch_first": True,
        }
        if self.rnn_type.lower() == "lstm":
            self.rnn: Union[nn.LSTM, nn.GRU] = nn.LSTM(**rnn_params)
        elif self.rnn_type.lower() == "gru":
            self.rnn = nn.GRU(**rnn_params)

        # FC-Head (Mlp)
        self.attention_blks = nn.ModuleList()
        for i in range(n_blocks):
            self.attention_blks.append(
                ContextAttentionEncoder(
                    self.rnn,
                    self.rnn_output_dim,
                    attn_dropout,
                    attn_concatenate,
                    with_addnorm=with_addnorm if i != n_blocks - 1 else False,
                    sum_along_seq=i == n_blocks - 1,
                )
            )

        # Mlp
        if self.head_hidden_dims is not None:
            head_hidden_dims = [self.rnn_output_dim] + self.head_hidden_dims
            self.rnn_mlp: Union[MLP, nn.Identity] = MLP(
                head_hidden_dims,
                head_activation,
                head_dropout,
                head_batchnorm,
                head_batchnorm_last,
                head_linear_first,
            )
        else:
            # simple hack to add readability in the forward pass
            self.rnn_mlp = nn.Identity()

    def forward(self, X: Tensor) -> Tensor:  # type: ignore
        x = self.embed_proj(self.word_embed(X.long()))

        h = nn.init.zeros_(
            torch.Tensor(2 if self.bidirectional else 1, X.shape[0], self.hidden_dim)
        ).to(x.device)
        if self.rnn_type == "lstm":
            c = nn.init.zeros_(
                torch.Tensor(
                    2 if self.bidirectional else 1, X.shape[0], self.hidden_dim
                )
            ).to(x.device)
        else:
            c = None

        for blk in self.attention_blks:
            x, h, c = blk(x, h, c)

        return self.rnn_mlp(x)

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class
        """
        return (
            self.head_hidden_dims[-1]
            if self.head_hidden_dims is not None
            else self.rnn_output_dim
        )

    @property
    def attention_weights(self) -> List:
        r"""List with the attention weights per block

        The shape of the attention weights is $(N, S)$ Where $N$ is the batch
        size and $S$ is the length of the sequence
        """
        return [blk.attn.attn_weights for blk in self.attention_blks]

    def _set_embeddings(
        self, embed_matrix: Union[Any, np.ndarray]
    ) -> Tuple[nn.Module, int]:
        if isinstance(embed_matrix, np.ndarray):
            assert (
                embed_matrix.dtype == "float32"
            ), "'embed_matrix' must be of dtype 'float32', got dtype '{}'".format(
                str(embed_matrix.dtype)
            )
            word_embed = nn.Embedding(
                self.vocab_size, embed_matrix.shape[1], padding_idx=self.padding_idx
            )
            if self.embed_trainable:
                word_embed.weight = nn.Parameter(
                    torch.tensor(embed_matrix), requires_grad=True
                )
            else:
                word_embed.weight = nn.Parameter(
                    torch.tensor(embed_matrix), requires_grad=False
                )
            embed_dim = embed_matrix.shape[1]
        else:
            assert self.embed_dim is not None
            word_embed = nn.Embedding(
                self.vocab_size, self.embed_dim, padding_idx=self.padding_idx
            )
            embed_dim = self.embed_dim

        return word_embed, embed_dim

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights per block

The shape of the attention weights is \((N, S)\) Where \(N\) is the batch size and \(S\) is the length of the sequence

HFModel

Bases: BaseWDModelComponent

This class is a wrapper around the Hugging Face transformers library. It can be used as the text component of a Wide & Deep model or independently by itself.

At the moment only models from the families BERT, RoBERTa, DistilBERT, ALBERT and ELECTRA are supported. This is because this library is designed to address classification and regression tasks and these are the most 'popular' encoder-only models, which have proved to be those that work best for these tasks.

Parameters:

Name Type Description Default
model_name str

The model name from the transformers library e.g. 'bert-base-uncased'. Currently supported models are those from the families: BERT, RoBERTa, DistilBERT, ALBERT and ELECTRA.

required
use_cls_token bool

Boolean indicating whether to use the [CLS] token or the mean of the sequence of hidden states as the sentence embedding

True
trainable_parameters Optional[List[str]]

List with the names of the model parameters that will be trained. If None, none of the parameters will be trainable

None
head_hidden_dims Optional[List[int]]

List with the sizes of the dense layers in the head e.g: [128, 64]

None
head_activation str

Activation function for the dense layers in the head. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
head_dropout Optional[float]

Dropout of the dense layers in the head

None
head_batchnorm bool

Boolean indicating whether or not to include batch normalization in the dense layers that form the head

False
head_batchnorm_last bool

Boolean indicating whether or not to apply batch normalization to the last of the dense layers in the head

False
head_linear_first bool

Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

False
verbose bool

If True, it will print information about the model

False
**kwargs

Additional kwargs to be passed to the model

{}

Attributes:

Name Type Description
head Module

Stack of dense layers on top of the transformer. This will only exists if head_layers_dim is not None

Examples:

>>> import torch
>>> from pytorch_widedeep.models import HFModel
>>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1).long()
>>> model = HFModel(model_name='bert-base-uncased')
>>> out = model(X_text)
Source code in pytorch_widedeep/models/text/huggingface_transformers/hf_model.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
class HFModel(BaseWDModelComponent):
    """This class is a wrapper around the Hugging Face transformers library. It
    can be used as the text component of a Wide & Deep model or independently
    by itself.

    At the moment only models from the families BERT, RoBERTa, DistilBERT,
    ALBERT and ELECTRA are supported. This is because this library is
    designed to address classification and regression tasks and these are the
    most 'popular' encoder-only models, which have proved to be those that
    work best for these tasks.

    Parameters
    ----------
    model_name: str
        The model name from the transformers library e.g. 'bert-base-uncased'.
        Currently supported models are those from the families: BERT, RoBERTa,
        DistilBERT, ALBERT and ELECTRA.
    use_cls_token: bool, default = True
        Boolean indicating whether to use the [CLS] token or the mean of the
        sequence of hidden states as the sentence embedding
    trainable_parameters: List, Optional, default = None
        List with the names of the model parameters that will be trained. If
        None, none of the parameters will be trainable
    head_hidden_dims: List, Optional, default = None
        List with the sizes of the dense layers in the head e.g: _[128, 64]_
    head_activation: str, default = "relu"
        Activation function for the dense layers in the head. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    head_dropout: float, Optional, default = None
        Dropout of the dense layers in the head
    head_batchnorm: bool, default = False
        Boolean indicating whether or not to include batch normalization in the
        dense layers that form the head
    head_batchnorm_last: bool, default = False
        Boolean indicating whether or not to apply batch normalization to the
        last of the dense layers in the head
    head_linear_first: bool, default = False
        Boolean indicating whether the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`
    verbose: bool, default = False
        If True, it will print information about the model
    **kwargs
        Additional kwargs to be passed to the model

    Attributes
    ----------
    head: nn.Module
        Stack of dense layers on top of the transformer. This will only exists
        if `head_layers_dim` is not None

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import HFModel
    >>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1).long()
    >>> model = HFModel(model_name='bert-base-uncased')
    >>> out = model(X_text)
    """

    @alias("use_cls_token", ["use_special_token"])
    def __init__(
        self,
        model_name: str,
        use_cls_token: bool = True,
        trainable_parameters: Optional[List[str]] = None,
        head_hidden_dims: Optional[List[int]] = None,
        head_activation: str = "relu",
        head_dropout: Optional[float] = None,
        head_batchnorm: bool = False,
        head_batchnorm_last: bool = False,
        head_linear_first: bool = False,
        verbose: bool = False,
        **kwargs,
    ):
        super().__init__()

        # TO DO: add warning regarging ELECTRA as ELECTRA does not have a cls
        # token.  Research what happens with ELECTRA
        self.model_name = model_name
        self.use_cls_token = use_cls_token
        self.trainable_parameters = trainable_parameters
        self.head_hidden_dims = head_hidden_dims
        self.head_activation = head_activation
        self.head_dropout = head_dropout
        self.head_batchnorm = head_batchnorm
        self.head_batchnorm_last = head_batchnorm_last
        self.head_linear_first = head_linear_first
        self.verbose = verbose
        self.kwargs = kwargs

        if self.verbose and self.use_cls_token:
            warnings.warn(
                "The model will use the [CLS] token. Make sure the tokenizer "
                "was run with add_special_tokens=True",
                UserWarning,
            )

        self.model_class = get_model_class(model_name)

        self.config, self.model = get_config_and_model(self.model_name)

        self.output_attention_weights = kwargs.get("output_attentions", False)

        if self.trainable_parameters is not None:
            for n, p in self.model.named_parameters():
                p.requires_grad = any([tl in n for tl in self.trainable_parameters])

        # FC-Head (Mlp). Note that the FC head will always be trainable
        if self.head_hidden_dims is not None:
            head_hidden_dims = [self.config.hidden_size] + self.head_hidden_dims
            self.head = MLP(
                head_hidden_dims,
                head_activation,
                head_dropout,
                head_batchnorm,
                head_batchnorm_last,
                head_linear_first,
            )

    def forward(self, X: Tensor) -> Tensor:

        # this is inefficient since the attention mask is returned by the
        # tokenizer, but all models in this library use a forward pass that
        # takes ONLY an input tensor. A fix will be addressed in a future
        attn_mask = (X != 0).type(torch.int8)

        output = self.model(input_ids=X, attention_mask=attn_mask, **self.kwargs)

        if self.output_attention_weights:
            # TO CONSIDER: attention weights as a returned object and not an
            # attribute
            self.attn_weights = output["attentions"]

        if self.use_cls_token:
            output = output[0][:, 0, :]
        else:
            # Here one can choose to flatten, but unless the sequence length
            # is very small, flatten will result in a very large output
            # tensor.
            output = output[0].mean(dim=1)

        if self.head_hidden_dims is not None:
            output = self.head(output)

        return output

    @property
    def output_dim(self) -> int:
        return (
            self.head_hidden_dims[-1]
            if self.head_hidden_dims is not None
            else self.config.hidden_size
        )

    @property
    def attention_weight(self) -> Tensor:
        r"""Returns the attention weights if the model was created with the
        output_attention_weights=True argument. If not, it will raise an
        AttributeError.

        The shape of the attention weights is $(N, H, F, F)$, where $N$ is the
        batch size, $H$ is the number of attention heads and $F$ is the
        sequence length.
        """
        if not self.output_attention_weights:
            raise AttributeError(
                "The output_attention_weights attribute was not set to True when creating the model object "
                "Please pass an output_attention_weights=True argument when creating the HFModel object"
            )
        return self.attn_weights

attention_weight property

attention_weight

Returns the attention weights if the model was created with the output_attention_weights=True argument. If not, it will raise an AttributeError.

The shape of the attention weights is \((N, H, F, F)\), where \(N\) is the batch size, \(H\) is the number of attention heads and \(F\) is the sequence length.

Vision

Bases: BaseWDModelComponent

Defines a standard image classifier/regressor using a pretrained network or a sequence of convolution layers that can be used as the deepimage component of a Wide & Deep model or independently by itself.

ℹ️ NOTE: this class represents the integration between pytorch-widedeep and torchvision. New architectures will be available as they are added to torchvision. In a distant future we aim to bring transformer-based architectures as well. However, simple CNN-based architectures (and even MLP-based) seem to produce SoTA results. For the time being, we describe below the options available through this class

Parameters:

Name Type Description Default
pretrained_model_setup Union[str, Dict[str, Union[str, WeightsEnum]]]

Name of the pretrained model. Should be a variant of the following architectures: 'resnet', 'shufflenet', 'resnext', 'wide_resnet', 'regnet', 'densenet', 'mobilenetv3', 'mobilenetv2', 'mnasnet', 'efficientnet' and 'squeezenet'. if pretrained_model_setup = None a basic, fully trainable CNN will be used. Alternatively, since Torchvision 0.13 one can use pretrained models with different weigths. Therefore, pretrained_model_setup can also be dictionary with the name of the model and the weights (e.g. {'resnet50': ResNet50_Weights.DEFAULT} or {'resnet50': "IMAGENET1K_V2"}).
Aliased as pretrained_model_name.

None
n_trainable Optional[int]

Number of trainable layers starting from the layer closer to the output neuron(s). Note that this number DOES NOT take into account the so-called 'head' which is ALWAYS trainable. If trainable_params is not None this parameter will be ignored

None
trainable_params Optional[List[str]]

List of strings containing the names (or substring within the name) of the parameters that will be trained. For example, if we use a 'resnet18' pretrained model and we set trainable_params = ['layer4'] only the parameters of 'layer4' of the network (and the head, as mentioned before) will be trained. Note that setting this or the previous parameter involves some knowledge of the architecture used.

None
channel_sizes List[int]

List of integers with the channel sizes of a CNN in case we choose not to use a pretrained model

[64, 128, 256, 512]
kernel_sizes Union[int, List[int]]

List of integers with the kernel sizes of a CNN in case we choose not to use a pretrained model. Must be of length equal to len(channel_sizes) - 1.

[7, 3, 3, 3]
strides Union[int, List[int]]

List of integers with the stride sizes of a CNN in case we choose not to use a pretrained model. Must be of length equal to len(channel_sizes) - 1.

[2, 1, 1, 1]
head_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the head. e.g: [64,32]

None
head_activation str

Activation function for the dense layers in the head. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
head_dropout Union[float, List[float]]

float indicating the dropout between the dense layers.

0.1
head_batchnorm bool

Boolean indicating whether or not batch normalization will be applied to the dense layers

False
head_batchnorm_last bool

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

False
head_linear_first bool

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

False

Attributes:

Name Type Description
features Module

The pretrained model or Standard CNN plus the optional head

Examples:

>>> import torch
>>> from pytorch_widedeep.models import Vision
>>> X_img = torch.rand((2,3,224,224))
>>> model = Vision(channel_sizes=[64, 128], kernel_sizes = [3, 3], strides=[1, 1], head_hidden_dims=[32, 8])
>>> out = model(X_img)
Source code in pytorch_widedeep/models/image/vision.py
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
class Vision(BaseWDModelComponent):
    r"""Defines a standard image classifier/regressor using a pretrained
    network or a sequence of convolution layers that can be used as the
    `deepimage` component of a Wide & Deep model or independently by
    itself.

    :information_source: **NOTE**: this class represents the integration
     between `pytorch-widedeep` and `torchvision`. New architectures will be
     available as they are added to `torchvision`. In a distant future we aim
     to bring transformer-based architectures as well. However, simple
     CNN-based architectures (and even MLP-based) seem to produce SoTA
     results. For the time being, we describe below the options available
     through this class

    Parameters
    ----------
    pretrained_model_setup: Optional, str or dict, default = None
        Name of the pretrained model. Should be a variant of the following
        architectures: _'resnet'_, _'shufflenet'_, _'resnext'_,
        _'wide_resnet'_, _'regnet'_, _'densenet'_, _'mobilenetv3'_,
        _'mobilenetv2'_, _'mnasnet'_, _'efficientnet'_ and _'squeezenet'_. if
        `pretrained_model_setup = None` a basic, fully trainable CNN will be
        used. Alternatively, since Torchvision 0.13 one can use pretrained
        models with different weigths. Therefore, `pretrained_model_setup` can
        also be dictionary with the name of the model and the weights (e.g.
        `{'resnet50': ResNet50_Weights.DEFAULT}` or
        `{'resnet50': "IMAGENET1K_V2"}`). <br/> Aliased as `pretrained_model_name`.
    n_trainable: Optional, int, default = None
        Number of trainable layers starting from the layer closer to the
        output neuron(s). Note that this number DOES NOT take into account
        the so-called _'head'_ which is ALWAYS trainable. If
        `trainable_params` is not None this parameter will be ignored
    trainable_params: Optional, list, default = None
        List of strings containing the names (or substring within the name) of
        the parameters that will be trained. For example, if we use a
        _'resnet18'_ pretrained model and we set `trainable_params =
        ['layer4']` only the parameters of _'layer4'_ of the network
        (and the head, as mentioned before) will be trained. Note that
        setting this or the previous parameter involves some knowledge of
        the architecture used.
    channel_sizes: list, default = [64, 128, 256, 512]
        List of integers with the channel sizes of a CNN in case we choose not
        to use a pretrained model
    kernel_sizes: list or int, default = 3
        List of integers with the kernel sizes of a CNN in case we choose not
        to use a pretrained model. Must be of length equal to `len(channel_sizes) - 1`.
    strides: list or int, default = 1
        List of integers with the stride sizes of a CNN in case we choose not
        to use a pretrained model. Must be of length equal to `len(channel_sizes) - 1`.
    head_hidden_dims: Optional, list, default = None
        List with the number of neurons per dense layer in the head. e.g: _[64,32]_
    head_activation: str, default = "relu"
        Activation function for the dense layers in the head. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    head_dropout: float, default = 0.1
        float indicating the dropout between the dense layers.
    head_batchnorm: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
    head_batchnorm_last: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
    head_linear_first: bool, default = False
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    features: nn.Module
        The pretrained model or Standard CNN plus the optional head

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models import Vision
    >>> X_img = torch.rand((2,3,224,224))
    >>> model = Vision(channel_sizes=[64, 128], kernel_sizes = [3, 3], strides=[1, 1], head_hidden_dims=[32, 8])
    >>> out = model(X_img)
    """

    @alias("pretrained_model_setup", ["pretrained_model_name"])
    def __init__(
        self,
        pretrained_model_setup: Union[str, Dict[str, Union[str, WeightsEnum]]] = None,
        n_trainable: Optional[int] = None,
        trainable_params: Optional[List[str]] = None,
        channel_sizes: List[int] = [64, 128, 256, 512],
        kernel_sizes: Union[int, List[int]] = [7, 3, 3, 3],
        strides: Union[int, List[int]] = [2, 1, 1, 1],
        head_hidden_dims: Optional[List[int]] = None,
        head_activation: str = "relu",
        head_dropout: Union[float, List[float]] = 0.1,
        head_batchnorm: bool = False,
        head_batchnorm_last: bool = False,
        head_linear_first: bool = False,
    ):
        super(Vision, self).__init__()

        self._check_pretrained_model_setup(
            pretrained_model_setup, n_trainable, trainable_params
        )

        self.pretrained_model_setup = pretrained_model_setup
        self.n_trainable = n_trainable
        self.trainable_params = trainable_params
        self.channel_sizes = channel_sizes
        self.kernel_sizes = kernel_sizes
        self.strides = strides
        self.head_hidden_dims = head_hidden_dims
        self.head_activation = head_activation
        self.head_dropout = head_dropout
        self.head_batchnorm = head_batchnorm
        self.head_batchnorm_last = head_batchnorm_last
        self.head_linear_first = head_linear_first

        self.features, self.backbone_output_dim = self._get_features()

        if pretrained_model_setup is not None:
            self._freeze(self.features)

        if self.head_hidden_dims is not None:
            head_hidden_dims = [self.backbone_output_dim] + self.head_hidden_dims
            self.vision_mlp = MLP(
                head_hidden_dims,
                self.head_activation,
                self.head_dropout,
                self.head_batchnorm,
                self.head_batchnorm_last,
                self.head_linear_first,
            )

    def forward(self, X: Tensor) -> Tensor:
        x = self.features(X)

        if len(x.shape) > 2:
            if x.shape[2] > 1:
                x = nn.functional.adaptive_avg_pool2d(x, (1, 1))
            x = torch.flatten(x, 1)

        if self.head_hidden_dims is not None:
            x = self.vision_mlp(x)

        return x

    @property
    def output_dim(self) -> int:
        r"""The output dimension of the model. This is a required property
        neccesary to build the `WideDeep` class
        """
        return (
            self.head_hidden_dims[-1]
            if self.head_hidden_dims is not None
            else self.backbone_output_dim
        )

    def _get_features(self) -> Tuple[nn.Module, int]:
        if self.pretrained_model_setup is not None:
            if isinstance(self.pretrained_model_setup, str):
                if self.pretrained_model_setup in allowed_pretrained_models.keys():
                    model = allowed_pretrained_models[self.pretrained_model_setup]
                    pretrained_model = torchvision.models.__dict__[model](
                        weights=torchvision.models.get_model_weights(model).DEFAULT
                    )
                    warnings.warn(
                        f"{self.pretrained_model_setup} defaulting to {model}",
                        UserWarning,
                    )
                else:
                    pretrained_model = torchvision.models.__dict__[
                        self.pretrained_model_setup
                    ](weights="IMAGENET1K_V1")

            elif isinstance(self.pretrained_model_setup, Dict):
                model_name = next(iter(self.pretrained_model_setup))
                model_weights = self.pretrained_model_setup[model_name]

                if model_name in allowed_pretrained_models.keys():
                    model_name = allowed_pretrained_models[model_name]

                pretrained_model = torchvision.models.__dict__[model_name](
                    weights=model_weights
                )
            output_dim: int = self.get_backbone_output_dim(pretrained_model)
            features = nn.Sequential(*(list(pretrained_model.children())[:-1]))

        else:
            features = self._basic_cnn()
            output_dim = self.channel_sizes[-1]

        return features, output_dim

    def _basic_cnn(self):
        channel_sizes = [3] + self.channel_sizes
        kernel_sizes = (
            [self.kernel_sizes] * len(self.channel_sizes)
            if isinstance(self.kernel_sizes, int)
            else self.kernel_sizes
        )
        strides = (
            [self.strides] * len(self.channel_sizes)
            if isinstance(self.strides, int)
            else self.strides
        )

        BasicCNN = nn.Sequential()
        for i in range(1, len(channel_sizes)):
            BasicCNN.add_module(
                "conv_layer_{}".format(i - 1),
                conv_layer(
                    channel_sizes[i - 1],
                    channel_sizes[i],
                    kernel_sizes[i - 1],
                    strides[i - 1],
                    maxpool=i == 1,
                    adaptiveavgpool=i == len(channel_sizes) - 1,
                ),
            )
        return BasicCNN

    def _freeze(self, features):
        if self.trainable_params is not None:
            for name, param in features.named_parameters():
                for tl in self.trainable_params:
                    param.requires_grad = tl in name
        elif self.n_trainable is not None:
            for i, (name, param) in enumerate(
                reversed(list(features.named_parameters()))
            ):
                param.requires_grad = i < self.n_trainable
        else:
            warnings.warn(
                "Both 'trainable_params' and 'n_trainable' are 'None' and the entire network will be trained",
                UserWarning,
            )

    @staticmethod
    def get_backbone_output_dim(features):
        try:
            return features.fc.in_features
        except AttributeError:
            try:
                features.classifier.__dict__["_modules"]["0"].in_features
            except AttributeError:
                try:
                    return features.classifier.__dict__["_modules"]["1"].in_features
                except AttributeError:
                    return features.classifier.__dict__["_modules"]["1"].in_channels

    @staticmethod
    def _check_pretrained_model_setup(
        pretrained_model_setup, n_trainable, trainable_params
    ):
        if pretrained_model_setup is not None:
            if isinstance(pretrained_model_setup, str):
                pretrained_model_name = pretrained_model_setup
            elif isinstance(pretrained_model_setup, Dict):
                pretrained_model_name = list(pretrained_model_setup.keys())[0]
        else:
            pretrained_model_name = None

        if pretrained_model_name is not None:
            valid_pretrained_model_name = any(
                [name in pretrained_model_name for name in allowed_pretrained_models]
            )
            if not valid_pretrained_model_name:
                raise ValueError(
                    f"{pretrained_model_setup} is not among the allowed pretrained models."
                    f" These are {allowed_pretrained_models.keys()}. Please choose a variant of these architectures"
                )
            if n_trainable is not None and trainable_params is not None:
                raise UserWarning(
                    "Both 'n_trainable' and 'trainable_params' are not None. 'trainable_params' will be used"
                )

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

ModelFuser

Bases: BaseWDModelComponent

This class is a wrapper around a list of models that are associated to the different text and/or image columns (and datasets) The class is designed to 'fuse' the models using a variety of methods.

Parameters:

Name Type Description Default
models List[BaseWDModelComponent]

List of models whose outputs will be fused

required
fusion_method Union[Literal[concatenate, mean, max, sum, mult, dot, head], List[Literal[concatenate, mean, max, sum, mult, head]]]

Method to fuse the output of the models. It can be one of ['concatenate', 'mean', 'max', 'sum', 'mult', 'dot', 'head'] or a list of those, but 'dot'. If a list is provided the output of the models will be fused using all the methods in the list and the final output will be the concatenation of the outputs of each method

required
projection_method Optional[Literal[min, max, mean]]

If the fusion_method is not 'concatenate', this parameter will determine how to project the output of the models to a common dimension. It can be one of ['min', 'max', 'mean']. Default is None

None
custom_head Optional[Union[BaseWDModelComponent, Module]]

Custom head to be used to fuse the output of the models. If provided, this will take precedence over head_hidden_dims. Also, if provided, 'projection_method' will be ignored.

None
head_hidden_dims Optional[List[int]]

List with the number of neurons per layer in the custom head. If custom_head is provided, this parameter will be ignored

None
head_activation Optional[str]

Activation function to be used in the custom head. Default is None

None
head_dropout Optional[float]

Dropout to be used in the custom head. Default is None

None
head_batchnorm Optional[bool]

Whether to use batchnorm in the custom head. Default is None

None
head_batchnorm_last Optional[bool]

Whether or not batch normalization will be applied to the last of the dense layers

None
head_linear_first Optional[bool]

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

None

Attributes:

Name Type Description
head Module or BaseWDModelComponent

Custom head to be used to fuse the output of the models. If custom_head is provided, this will take precedence over head_hidden_dims

Examples:

>>> from pytorch_widedeep.preprocessing import TextPreprocessor
>>> from pytorch_widedeep.models import BasicRNN, ModelFuser
>>> import torch
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'text_col1': ['hello world', 'this is a test'],
... 'text_col2': ['goodbye world', 'this is another test']})
>>> text_preprocessor_1 = TextPreprocessor(
...     text_col="text_col1",
...     max_vocab=10,
...     min_freq=1,
...     maxlen=5,
...     n_cpus=1,
...     verbose=0)
>>> text_preprocessor_2 = TextPreprocessor(
...     text_col="text_col2",
...     max_vocab=10,
...     min_freq=1,
...     maxlen=5,
...     n_cpus=1,
...     verbose=0)
>>> X_text1 = text_preprocessor_1.fit_transform(df)
>>> X_text2 = text_preprocessor_2.fit_transform(df)
>>> X_text1_tnsr = torch.from_numpy(X_text1)
>>> X_text2_tnsr = torch.from_numpy(X_text2)
>>> rnn1 = BasicRNN(
...     vocab_size=len(text_preprocessor_1.vocab.itos),
...     embed_dim=4,
...     hidden_dim=4,
...     n_layers=1,
...     bidirectional=False)
>>> rnn2 = BasicRNN(
...     vocab_size=len(text_preprocessor_2.vocab.itos),
...     embed_dim=4,
...     hidden_dim=4,
...     n_layers=1,
...     bidirectional=False)
>>> fused_model = ModelFuser(models=[rnn1, rnn2], fusion_method='concatenate')
>>> out = fused_model([X_text1_tnsr, X_text2_tnsr])
Source code in pytorch_widedeep/models/model_fusion.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
class ModelFuser(BaseWDModelComponent):
    """
    This class is a wrapper around a list of models that are associated to the
    different text and/or image columns (and datasets) The class is designed
    to 'fuse' the models using a variety of methods.

    Parameters
    ----------
    models: List[BaseWDModelComponent]
        List of models whose outputs will be fused
    fusion_method: Union[str, List[str]]
        Method to fuse the output of the models. It can be one of
        ['concatenate', 'mean', 'max', 'sum', 'mult', 'dot', 'head'] or a
        list of those, but 'dot'. If a list is provided the output of the
        models will be fused using all the methods in the list and the final
        output will be the concatenation of the outputs of each method
    projection_method: Optional[str]
        If the fusion_method is not 'concatenate', this parameter will
        determine how to project the output of the models to a common
        dimension. It can be one of ['min', 'max', 'mean']. Default is None
    custom_head: Optional[BaseWDModelComponent | nn.Module]
        Custom head to be used to fuse the output of the models. If provided,
        this will take precedence over head_hidden_dims. Also, if
        provided, 'projection_method' will be ignored.
    head_hidden_dims: Optional[List[int]]
        List with the number of neurons per layer in the custom head. If
        custom_head is provided, this parameter will be ignored
    head_activation: Optional[str]
        Activation function to be used in the custom head. Default is None
    head_dropout: Optional[float]
        Dropout to be used in the custom head. Default is None
    head_batchnorm: Optional[bool]
        Whether to use batchnorm in the custom head. Default is None
    head_batchnorm_last: Optional[bool]
        Whether or not batch normalization will be applied to the last of the
        dense layers
    head_linear_first: Optional[bool]
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    head: nn.Module or BaseWDModelComponent
        Custom head to be used to fuse the output of the models. If
        custom_head is provided, this will take precedence over
        head_hidden_dims

    Examples
    --------
    >>> from pytorch_widedeep.preprocessing import TextPreprocessor
    >>> from pytorch_widedeep.models import BasicRNN, ModelFuser
    >>> import torch
    >>> import pandas as pd
    >>>
    >>> df = pd.DataFrame({'text_col1': ['hello world', 'this is a test'],
    ... 'text_col2': ['goodbye world', 'this is another test']})
    >>> text_preprocessor_1 = TextPreprocessor(
    ...     text_col="text_col1",
    ...     max_vocab=10,
    ...     min_freq=1,
    ...     maxlen=5,
    ...     n_cpus=1,
    ...     verbose=0)
    >>> text_preprocessor_2 = TextPreprocessor(
    ...     text_col="text_col2",
    ...     max_vocab=10,
    ...     min_freq=1,
    ...     maxlen=5,
    ...     n_cpus=1,
    ...     verbose=0)
    >>> X_text1 = text_preprocessor_1.fit_transform(df)
    >>> X_text2 = text_preprocessor_2.fit_transform(df)
    >>> X_text1_tnsr = torch.from_numpy(X_text1)
    >>> X_text2_tnsr = torch.from_numpy(X_text2)
    >>> rnn1 = BasicRNN(
    ...     vocab_size=len(text_preprocessor_1.vocab.itos),
    ...     embed_dim=4,
    ...     hidden_dim=4,
    ...     n_layers=1,
    ...     bidirectional=False)
    >>> rnn2 = BasicRNN(
    ...     vocab_size=len(text_preprocessor_2.vocab.itos),
    ...     embed_dim=4,
    ...     hidden_dim=4,
    ...     n_layers=1,
    ...     bidirectional=False)
    >>> fused_model = ModelFuser(models=[rnn1, rnn2], fusion_method='concatenate')
    >>> out = fused_model([X_text1_tnsr, X_text2_tnsr])
    """

    def __init__(
        self,
        models: List[BaseWDModelComponent],
        *,
        fusion_method: Union[
            Literal[
                "concatenate",
                "mean",
                "max",
                "sum",
                "mult",
                "dot",
                "head",
            ],
            List[Literal["concatenate", "mean", "max", "sum", "mult", "head"]],
        ],
        projection_method: Optional[Literal["min", "max", "mean"]] = None,
        custom_head: Optional[Union[BaseWDModelComponent, nn.Module]] = None,
        head_hidden_dims: Optional[List[int]] = None,
        head_activation: Optional[str] = None,
        head_dropout: Optional[float] = None,
        head_batchnorm: Optional[bool] = None,
        head_batchnorm_last: Optional[bool] = None,
        head_linear_first: Optional[bool] = None,
    ) -> None:
        super(ModelFuser, self).__init__()

        self.models = nn.ModuleList(models)
        self.fusion_method = fusion_method
        self.projection_method = projection_method

        self.all_output_dim_equal = all(
            model.output_dim == self.models[0].output_dim for model in self.models
        )

        self.check_input_parameters()

        if self.fusion_method == "head":
            assert (
                head_hidden_dims is not None or custom_head is not None
            ), "When using 'head' as fusion_method, either head_hidden_dims or custom_head must be provided"
            if custom_head is not None:
                # custom_head takes precedence over head_hidden_dims (in case
                # both are provided)
                assert hasattr(
                    custom_head, "output_dim"
                ), "custom_head must have an 'output_dim' property"
                self.head: Union[BaseWDModelComponent, nn.Module] = custom_head
            else:
                assert head_hidden_dims is not None
                self.head_hidden_dims = head_hidden_dims
                self.head_activation = head_activation
                self.head_dropout = head_dropout
                self.head_batchnorm = head_batchnorm
                self.head_batchnorm_last = head_batchnorm_last
                self.head_linear_first = head_linear_first

                self.head = MLP(
                    d_hidden=[sum([model.output_dim for model in self.models])]
                    + self.head_hidden_dims,
                    activation=(
                        "relu" if self.head_activation is None else self.head_activation
                    ),
                    dropout=0.0 if self.head_dropout is None else self.head_dropout,
                    batchnorm=(
                        False if self.head_batchnorm is None else self.head_batchnorm
                    ),
                    batchnorm_last=(
                        False
                        if self.head_batchnorm_last is None
                        else self.head_batchnorm_last
                    ),
                    linear_first=(
                        True
                        if self.head_linear_first is None
                        else self.head_linear_first
                    ),
                )

    def forward(self, X: List[Tensor]) -> Tensor:  # noqa: C901
        if self.fusion_method == "head":
            return self._head_fusion(X)
        elif self.fusion_method == "dot":
            return self._dot_fusion(X)
        else:
            return self._other_fusions(X)

    def _head_fusion(self, X: List[Tensor]) -> Tensor:
        return self.head(torch.cat([model(x) for model, x in zip(self.models, X)], -1))

    def _dot_fusion(self, X: List[Tensor]) -> Tensor:
        assert (
            len(X) == 2
        ), "When using 'dot' as fusion_method, only two models can be fused"
        outputs = [model(x) for model, x in zip(self.models, X)]
        return torch.bmm(outputs[1].unsqueeze(1), outputs[0].unsqueeze(2)).view(-1, 1)

    def _other_fusions(self, X: List[Tensor]) -> Tensor:
        fusion_methods = (
            [self.fusion_method]
            if isinstance(self.fusion_method, str)
            else self.fusion_method
        )
        fused_outputs = [self._apply_fusion_method(fm, X) for fm in fusion_methods]  # type: ignore[attr-defined]
        return (
            fused_outputs[0] if len(fused_outputs) == 1 else torch.cat(fused_outputs, 1)
        )

    def _apply_fusion_method(self, fusion_method: str, X: List[Tensor]) -> Tensor:
        if fusion_method == "concatenate":
            return torch.cat([model(x) for model, x in zip(self.models, X)], 1)

        model_outputs = [model(x) for model, x in zip(self.models, X)]
        projections = self._project(model_outputs)
        stacked = torch.stack(projections, -1)

        if fusion_method == "mean":
            return torch.mean(stacked, -1)
        elif fusion_method == "max":
            return torch.max(stacked, -1)[0]
        elif fusion_method == "min":
            return torch.min(stacked, -1)[0]
        elif fusion_method == "sum":
            return torch.sum(stacked, -1)
        elif fusion_method == "mult":
            return torch.prod(stacked, -1)
        else:
            raise ValueError(f"Unsupported fusion method: {fusion_method}")

    def _project(self, X: List[Tensor]) -> List[Tensor]:
        r"""Projects the output of the models to a common dimension."""

        if self.all_output_dim_equal and self.projection_method is None:
            return X

        output_dims = [model.output_dim for model in self.models]

        if self.projection_method == "min":
            proj_dim = min(output_dims)
            idx = output_dims.index(proj_dim)
        elif self.projection_method == "max":
            proj_dim = max(output_dims)
            idx = output_dims.index(proj_dim)
        elif self.projection_method == "mean":
            proj_dim = int(sum(output_dims) / len(output_dims))
            idx = None
        else:
            raise ValueError("projection_method must be one of ['min', 'max', 'mean']")

        x_proj: List[Tensor] = []
        for i, x in enumerate(X):
            if i == idx:
                x_proj.append(x)
            else:
                x_proj.append(
                    nn.Linear(output_dims[i], proj_dim, bias=False, device=x.device)(x)
                )

        return x_proj

    @property
    def output_dim(self) -> int:
        r"""Returns the output dimension of the model."""
        if self.fusion_method == "head":
            output_dim = (
                self.head_hidden_dims[-1]
                if hasattr(self, "head_hidden_dims")
                else self.head.output_dim
            )
        elif self.fusion_method == "dot":
            output_dim = 1
        else:
            output_dim = 0
            if isinstance(self.fusion_method, str):
                fusion_methods = [self.fusion_method]
            else:
                fusion_methods = self.fusion_method  # type: ignore
            for fm in fusion_methods:
                if fm == "concatenate":
                    output_dim += sum([model.output_dim for model in self.models])
                elif self.projection_method == "mean":
                    output_dim += int(
                        sum([model.output_dim for model in self.models])
                        / len(self.models)
                    )
                elif self.projection_method == "min":
                    output_dim += min([model.output_dim for model in self.models])
                elif self.projection_method == "max":
                    output_dim += max([model.output_dim for model in self.models])
                elif self.all_output_dim_equal:
                    output_dim += self.models[0].output_dim
                else:
                    raise ValueError(
                        "projection_method must be one of ['min', 'max', 'mean']"
                    )

        return output_dim

    def check_input_parameters(self):
        self._validate_tabnet()
        self._validate_fusion_methods()
        self._validate_projection_requirements()
        self._validate_head_dot_exclusivity()

    def _validate_tabnet(self):
        if any(isinstance(model, TabNet) for model in self.models):
            raise ValueError(
                "TabNet is not supported in ModelFuser. "
                "Please, use another model for tabular data"
            )

    def _validate_fusion_methods(self):
        valid_methods = [
            "concatenate",
            "min",
            "max",
            "mean",
            "sum",
            "mult",
            "dot",
            "head",
        ]

        if isinstance(self.fusion_method, str):
            if self.fusion_method not in valid_methods:
                raise ValueError(
                    "fusion_method must be one of ['concatenate', 'mean', 'max', 'sum', 'mult', 'dot', 'head'] "
                    "or a list of any those but 'dot'"
                )
        else:
            if not all(fm in valid_methods for fm in self.fusion_method):
                raise ValueError(
                    "fusion_method must be one of ['concatenate', 'mean', 'max', 'sum', 'mult', 'dot', 'head'] "
                    "or a list of those but 'dot'"
                )

    def _validate_projection_requirements(self):
        needs_projection = not self.all_output_dim_equal
        has_size_dependent_method = False

        if isinstance(self.fusion_method, str):
            has_size_dependent_method = self.fusion_method in ["min", "max", "mean"]
        else:
            has_size_dependent_method = all(
                fm in ["min", "max", "mean"] for fm in self.fusion_method
            )

        if has_size_dependent_method and needs_projection:
            if self.projection_method is None:
                raise ValueError(
                    "If 'fusion_method' is not 'concatenate' or 'head', "
                    "and the output dimensions of the models are not equal, "
                    "'projection_method' must be provided"
                )
            elif self.projection_method not in ["min", "max", "mean"]:
                raise ValueError(
                    "projection_method must be one of ['min', 'max', 'mean']"
                )

    def _validate_head_dot_exclusivity(self):
        if isinstance(self.fusion_method, list):
            if any(method in ["head", "dot"] for method in self.fusion_method):
                raise ValueError(
                    "When using 'head' or 'dot' as fusion_method, no other method should be provided"
                )

    def __repr__(self):
        if self.projection_method is not None:
            proj = f"{self.projection_method}"
        else:
            proj = ""

        return f"Fusion method: {self.fusion_method}. Projection method: {proj}\nFused Models:\n{self.models}"

output_dim property

output_dim

Returns the output dimension of the model.

WideDeep

Bases: Module

Main collector class that combines all wide, deeptabular deeptext and deepimage models.

Note that all models described so far in this library must be passed to the WideDeep class once constructed. This is because the models output the last layer before the prediction layer. Such prediction layer is added by the WideDeep class as it collects the components for every data mode.

There are two options to combine these models that correspond to the two main architectures that pytorch-widedeep can build.

  • Directly connecting the output of the model components to an ouput neuron(s).

  • Adding a Fully-Connected Head (FC-Head) on top of the deep models. This FC-Head will combine the output form the deeptabular, deeptext and deepimage and will be then connected to the output neuron(s).

Parameters:

Name Type Description Default
wide Optional[Module]

Wide model. This is a linear model where the non-linearities are captured via crossed-columns.

None
deeptabular Optional[Union[BaseWDModelComponent, List[BaseWDModelComponent]]]

Currently this library implements a number of possible architectures for the deeptabular component. See the documenation of the package. Note that deeptabular can be a list of models. This is useful when using multiple tabular inputs (e.g. for example in the context of a two-tower model for recommendation systems)

None
deeptext Optional[Union[BaseWDModelComponent, List[BaseWDModelComponent]]]

Currently this library implements a number of possible architectures for the deeptext component. See the documenation of the package. Note that deeptext can be a list of models. This is useful when using multiple text inputs.

None
deepimage Optional[Union[BaseWDModelComponent, List[BaseWDModelComponent]]]

Currently this library uses torchvision and implements a number of possible architectures for the deepimage component. See the documenation of the package. Note that deepimage can be a list of models. This is useful when using multiple image inputs.

None
deephead Optional[BaseWDModelComponent]

Alternatively, the user can pass a custom model that will receive the output of the deep component. If deephead is not None all the previous fc-head parameters will be ignored

None
head_hidden_dims Optional[List[int]]

List with the sizes of the dense layers in the head e.g: [128, 64]

None
head_activation str

Activation function for the dense layers in the head. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
head_dropout float

Dropout of the dense layers in the head

0.1
head_batchnorm bool

Boolean indicating whether or not to include batch normalization in the dense layers that form the 'rnn_mlp'

False
head_batchnorm_last bool

Boolean indicating whether or not to apply batch normalization to the last of the dense layers in the head

False
head_linear_first bool

Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

True
enforce_positive bool

Boolean indicating if the output from the final layer must be positive. This is important if you are using loss functions with non-negative input restrictions, e.g. RMSLE, or if you know your predictions are bounded in between 0 and inf

False
enforce_positive_activation str

Activation function to enforce that the final layer has a positive output. 'softplus' or 'relu' are supported.

'softplus'
pred_dim int

Size of the final wide and deep output layer containing the predictions. 1 for regression and binary classification or number of classes for multiclass classification.

1

Examples:

>>> from pytorch_widedeep.models import TabResnet, Vision, BasicRNN, Wide, WideDeep
>>> embed_input = [(u, i, j) for u, i, j in zip(["a", "b", "c"][:4], [4] * 3, [8] * 3)]
>>> column_idx = {k: v for v, k in enumerate(["a", "b", "c"])}
>>> wide = Wide(10, 1)
>>> deeptabular = TabResnet(blocks_dims=[8, 4], column_idx=column_idx, cat_embed_input=embed_input)
>>> deeptext = BasicRNN(vocab_size=10, embed_dim=4, padding_idx=0)
>>> deepimage = Vision()
>>> model = WideDeep(wide=wide, deeptabular=deeptabular, deeptext=deeptext, deepimage=deepimage)

ℹ️ NOTE: It is possible to use custom components to build Wide & Deep models. Simply, build them and pass them as the corresponding parameters. Note that the custom models MUST return a last layer of activations(i.e. not the final prediction) so that these activations are collected by WideDeep and combined accordingly. In addition, the models MUST also contain an attribute output_dim with the size of these last layers of activations. See for example pytorch_widedeep.models.tab_mlp.TabMlp

Source code in pytorch_widedeep/models/wide_deep.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
class WideDeep(nn.Module):
    r"""Main collector class that combines all `wide`, `deeptabular`
    `deeptext` and `deepimage` models.

    Note that all models described so far in this library must be passed to
    the `WideDeep` class once constructed. This is because the models output
    the last layer before the prediction layer. Such prediction layer is
    added by the `WideDeep` class as it collects the components for every
    data mode.

    There are two options to combine these models that correspond to the
    two main architectures that `pytorch-widedeep` can build.

    - Directly connecting the output of the model components to an ouput neuron(s).

    - Adding a `Fully-Connected Head` (FC-Head) on top of the deep models.
      This FC-Head will combine the output form the `deeptabular`, `deeptext` and
      `deepimage` and will be then connected to the output neuron(s).

    Parameters
    ----------
    wide: nn.Module, Optional, default = None
        `Wide` model. This is a linear model where the non-linearities are
        captured via crossed-columns.
    deeptabular: BaseWDModelComponent, Optional, default = None
        Currently this library implements a number of possible architectures
        for the `deeptabular` component. See the documenation of the
        package. Note that `deeptabular` can be a list of models. This is
        useful when using multiple tabular inputs (e.g. for example in the
        context of a two-tower model for recommendation systems)
    deeptext: BaseWDModelComponent | List[BaseWDModelComponent], Optional, default = None
        Currently this library implements a number of possible architectures
        for the `deeptext` component. See the documenation of the
        package. Note that `deeptext` can be a list of models. This is useful
        when using multiple text inputs.
    deepimage: BaseWDModelComponent | List[BaseWDModelComponent], Optional, default = None
        Currently this library uses `torchvision` and implements a number of
        possible architectures for the `deepimage` component. See the
        documenation of the package. Note that `deepimage` can be a list of
        models. This is useful when using multiple image inputs.
    deephead: BaseWDModelComponent, Optional, default = None
        Alternatively, the user can pass a custom model that will receive the
        output of the deep component. If `deephead` is not None all the
        previous fc-head parameters will be ignored
    head_hidden_dims: List, Optional, default = None
        List with the sizes of the dense layers in the head e.g: [128, 64]
    head_activation: str, default = "relu"
        Activation function for the dense layers in the head. Currently
        `'tanh'`, `'relu'`, `'leaky_relu'` and `'gelu'` are supported
    head_dropout: float, Optional, default = None
        Dropout of the dense layers in the head
    head_batchnorm: bool, default = False
        Boolean indicating whether or not to include batch normalization in
        the dense layers that form the `'rnn_mlp'`
    head_batchnorm_last: bool, default = False
        Boolean indicating whether or not to apply batch normalization to the
        last of the dense layers in the head
    head_linear_first: bool, default = False
        Boolean indicating whether the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`
    enforce_positive: bool, default = False
        Boolean indicating if the output from the final layer must be
        positive. This is important if you are using loss functions with
        non-negative input restrictions, e.g. RMSLE, or if you know your
        predictions are bounded in between 0 and inf
    enforce_positive_activation: str, default = "softplus"
        Activation function to enforce that the final layer has a positive
        output. `'softplus'` or `'relu'` are supported.
    pred_dim: int, default = 1
        Size of the final wide and deep output layer containing the
        predictions. `1` for regression and binary classification or number
        of classes for multiclass classification.


    Examples
    --------

    >>> from pytorch_widedeep.models import TabResnet, Vision, BasicRNN, Wide, WideDeep
    >>> embed_input = [(u, i, j) for u, i, j in zip(["a", "b", "c"][:4], [4] * 3, [8] * 3)]
    >>> column_idx = {k: v for v, k in enumerate(["a", "b", "c"])}
    >>> wide = Wide(10, 1)
    >>> deeptabular = TabResnet(blocks_dims=[8, 4], column_idx=column_idx, cat_embed_input=embed_input)
    >>> deeptext = BasicRNN(vocab_size=10, embed_dim=4, padding_idx=0)
    >>> deepimage = Vision()
    >>> model = WideDeep(wide=wide, deeptabular=deeptabular, deeptext=deeptext, deepimage=deepimage)


    :information_source: **NOTE**: It is possible to use custom components to
     build Wide & Deep models. Simply, build them and pass them as the
     corresponding parameters. Note that the custom models MUST return a last
     layer of activations(i.e. not the final prediction) so that  these
     activations are collected by `WideDeep` and combined accordingly. In
     addition, the models MUST also contain an attribute `output_dim` with
     the size of these last layers of activations. See for example
     `pytorch_widedeep.models.tab_mlp.TabMlp`
    """

    @alias(  # noqa: C901
        "pred_dim",
        ["num_class", "pred_size"],
    )
    def __init__(
        self,
        wide: Optional[nn.Module] = None,
        deeptabular: Optional[
            Union[BaseWDModelComponent, List[BaseWDModelComponent]]
        ] = None,
        deeptext: Optional[
            Union[BaseWDModelComponent, List[BaseWDModelComponent]]
        ] = None,
        deepimage: Optional[
            Union[BaseWDModelComponent, List[BaseWDModelComponent]]
        ] = None,
        deephead: Optional[BaseWDModelComponent] = None,
        head_hidden_dims: Optional[List[int]] = None,
        head_activation: str = "relu",
        head_dropout: float = 0.1,
        head_batchnorm: bool = False,
        head_batchnorm_last: bool = False,
        head_linear_first: bool = True,
        enforce_positive: bool = False,
        enforce_positive_activation: str = "softplus",
        pred_dim: int = 1,
    ):
        super(WideDeep, self).__init__()

        self._check_inputs(
            wide,
            deeptabular,
            deeptext,
            deepimage,
            deephead,
            head_hidden_dims,
            pred_dim,
        )

        # this attribute will be eventually over-written by the Trainer's
        # device. Acts here as a 'placeholder'.
        self.wd_device: Optional[str] = None

        # required as attribute just in case we pass a deephead
        self.pred_dim = pred_dim

        self.enforce_positive = enforce_positive

        # better to set this attribute already here
        if isinstance(deeptabular, list):
            self.is_tabnet = False
        else:
            self.is_tabnet = deeptabular.__class__.__name__ == "TabNet"

        # The main 5 components of the wide and deep assemble: wide,
        # deeptabular, deeptext, deepimage and deephead
        self.with_deephead = deephead is not None or head_hidden_dims is not None
        if deephead is None and head_hidden_dims is not None:
            self.deephead = self._build_deephead(
                deeptabular,
                deeptext,
                deepimage,
                head_hidden_dims,
                head_activation,
                head_dropout,
                head_batchnorm,
                head_batchnorm_last,
                head_linear_first,
            )
        elif deephead is not None:
            self.deephead = nn.Sequential(
                deephead, nn.Linear(deephead.output_dim, self.pred_dim)
            )
        else:
            # for consistency with other components we default to None
            self.deephead = None

        self.wide = wide
        self.deeptabular, self.deeptext, self.deepimage = self._set_model_components(
            deeptabular, deeptext, deepimage
        )

        if self.enforce_positive:
            self.enf_pos = get_activation_fn(enforce_positive_activation)

    def forward(
        self,
        X: Dict[str, Union[Tensor, List[Tensor]]],
        y: Optional[Tensor] = None,
    ) -> Union[Tensor, Tuple[Tensor, Tensor]]:

        wide_out = self._forward_wide(X)
        if self.with_deephead:
            deep = self._forward_deephead(X, wide_out)
        else:
            deep = self._forward_deep(X, wide_out)

        if self.enforce_positive:
            return self.enf_pos(deep)
        else:
            return deep

    def _build_deephead(
        self,
        deeptabular: Optional[Union[BaseWDModelComponent, List[BaseWDModelComponent]]],
        deeptext: Optional[Union[BaseWDModelComponent, List[BaseWDModelComponent]]],
        deepimage: Optional[Union[BaseWDModelComponent, List[BaseWDModelComponent]]],
        head_hidden_dims: List[int],
        head_activation: str,
        head_dropout: float,
        head_batchnorm: bool,
        head_batchnorm_last: bool,
        head_linear_first: bool,
    ) -> nn.Sequential:
        deep_dim = 0
        if deeptabular is not None:
            if isinstance(deeptabular, list):
                for dt in deeptabular:
                    deep_dim += dt.output_dim
            else:
                deep_dim += deeptabular.output_dim
        if deeptext is not None:
            if isinstance(deeptext, list):
                for dt in deeptext:
                    deep_dim += dt.output_dim
            else:
                deep_dim += deeptext.output_dim
        if deepimage is not None:
            if isinstance(deepimage, list):
                for di in deepimage:
                    deep_dim += di.output_dim
            else:
                deep_dim += deepimage.output_dim

        head_hidden_dims = [deep_dim] + head_hidden_dims
        deephead = nn.Sequential(
            MLP(
                head_hidden_dims,
                head_activation,
                head_dropout,
                head_batchnorm,
                head_batchnorm_last,
                head_linear_first,
            ),
            nn.Linear(head_hidden_dims[-1], self.pred_dim),
        )

        return deephead

    def _set_model_components(  # noqa: C901
        self,
        deeptabular: Optional[Union[BaseWDModelComponent, List[BaseWDModelComponent]]],
        deeptext: Optional[Union[BaseWDModelComponent, List[BaseWDModelComponent]]],
        deepimage: Optional[Union[BaseWDModelComponent, List[BaseWDModelComponent]]],
    ) -> Tuple[
        Optional[Union[nn.ModuleList, WDModel]],
        Optional[Union[nn.ModuleList, WDModel]],
        Optional[Union[nn.ModuleList, WDModel]],
    ]:
        if deeptabular is not None:
            deeptabular_ = self._set_model_component(deeptabular, is_deeptabular=True)
        else:
            deeptabular_ = None

        if deeptext is not None:
            deeptext_ = self._set_model_component(deeptext)
        else:
            deeptext_ = None

        if deepimage is not None:
            deepimage_ = self._set_model_component(deepimage)
        else:
            deepimage_ = None

        return deeptabular_, deeptext_, deepimage_

    def _forward_wide(self, X: Dict[str, Union[Tensor, List[Tensor]]]) -> Tensor:
        if self.wide is not None:
            out = self.wide(X["wide"])
        else:
            first_model_mode = list(X.keys())[0]
            if isinstance(X[first_model_mode], list):
                batch_size = X[first_model_mode][0].size(0)
            else:
                batch_size = X[first_model_mode].size(0)  # type: ignore[union-attr]
            out = torch.zeros(batch_size, self.pred_dim).to(self.wd_device)

        return out

    def _forward_deep(
        self, X: Dict[str, Union[Tensor, List[Tensor]]], wide_out: Tensor
    ) -> Union[Tensor, Tuple[Tensor, Tensor]]:
        if self.deeptabular is not None:
            if self.is_tabnet:
                tab_out, M_loss = self.deeptabular(X["deeptabular"])
                wide_out.add_(tab_out)
            else:
                wide_out = self._forward_component(
                    X, self.deeptabular, "deeptabular", wide_out
                )

        if self.deeptext is not None:
            wide_out = self._forward_component(X, self.deeptext, "deeptext", wide_out)

        if self.deepimage is not None:
            wide_out = self._forward_component(X, self.deepimage, "deepimage", wide_out)

        if self.is_tabnet:
            res: Union[Tensor, Tuple[Tensor, Tensor]] = (wide_out, M_loss)
        else:
            res = wide_out

        return res

    def _forward_deephead(
        self, X: Dict[str, Union[Tensor, List[Tensor]]], wide_out: Tensor
    ) -> Union[Tensor, Tuple[Tensor, Tensor]]:
        deepside = torch.FloatTensor().to(self.wd_device)

        if self.deeptabular is not None:
            if self.is_tabnet:
                deepside, M_loss = self.deeptabular(X["deeptabular"])
            else:
                deepside = self._forward_component_with_head(
                    X, self.deeptabular, "deeptabular", deepside
                )

        if self.deeptext is not None:
            deepside = self._forward_component_with_head(
                X, self.deeptext, "deeptext", deepside
            )

        if self.deepimage is not None:
            deepside = self._forward_component_with_head(
                X, self.deepimage, "deepimage", deepside
            )

        # assertion to avoid type issues
        assert self.deephead is not None
        deepside_out = self.deephead(deepside)

        if self.is_tabnet:
            res: Union[Tensor, Tuple[Tensor, Tensor]] = (
                wide_out.add_(deepside_out),
                M_loss,
            )
        else:
            res = wide_out.add_(deepside_out)

        return res

    def _forward_component(
        self,
        X: Dict[str, Union[Tensor, List[Tensor]]],
        component: Union[nn.ModuleList, WDModel],
        component_type: Literal["deeptabular", "deeptext", "deepimage"],
        wide_out: Tensor,
    ) -> Tensor:
        if isinstance(component, nn.ModuleList):
            component_out = torch.add(  # type: ignore[call-overload]
                *[cp(X[component_type][i]) for i, cp in enumerate(component)]
            )
        else:
            component_out = component(X[component_type])

        return wide_out.add_(component_out)

    def _forward_component_with_head(
        self,
        X: Dict[str, Union[Tensor, List[Tensor]]],
        component: Union[nn.ModuleList, WDModel],
        component_type: Literal["deeptabular", "deeptext", "deepimage"],
        deepside: Tensor,
    ) -> Tensor:
        if isinstance(component, nn.ModuleList):
            component_out = torch.cat(  # type: ignore[call-overload]
                [cp(X[component_type][i]) for i, cp in enumerate(component)], axis=1
            )
        else:
            component_out = component(X[component_type])

        return torch.cat([deepside, component_out], axis=1)  # type: ignore[call-overload]

    def _set_model_component(
        self,
        component: Union[BaseWDModelComponent, List[BaseWDModelComponent]],
        is_deeptabular: bool = False,
    ) -> Union[nn.ModuleList, WDModel]:
        if isinstance(component, list):
            component_: Optional[Union[nn.ModuleList, WDModel]] = nn.ModuleList()
            for cp in component:
                if self.with_deephead or cp.output_dim == 1:
                    component_.append(cp)
                else:
                    component_.append(
                        nn.Sequential(cp, nn.Linear(cp.output_dim, self.pred_dim))
                    )
        elif self.with_deephead or component.output_dim == 1:
            component_ = component
        elif is_deeptabular and self.is_tabnet:
            component_ = nn.Sequential(
                component, TabNetPredLayer(component.output_dim, self.pred_dim)
            )
        else:
            component_ = nn.Sequential(
                component, nn.Linear(component.output_dim, self.pred_dim)
            )

        return component_

    @staticmethod  # noqa: C901
    def _check_inputs(  # noqa: C901
        wide,
        deeptabular,
        deeptext,
        deepimage,
        deephead,
        head_hidden_dims,
        pred_dim,
    ):
        if wide is not None:
            assert wide.wide_linear.weight.size(1) == pred_dim, (
                "the 'pred_dim' of the wide component ({}) must be equal to the 'pred_dim' "
                "of the deep component and the overall model itself ({})".format(
                    wide.wide_linear.weight.size(1), pred_dim
                )
            )

        if deeptabular is not None:
            err_msg = (
                "deeptabular model must have an 'output_dim' attribute or property."
            )
            if isinstance(deeptabular, list):
                all_have_output_dim = all(
                    hasattr(dt, "output_dim") for dt in deeptabular
                )
                if not all_have_output_dim:
                    raise AttributeError(err_msg)
            else:
                if not hasattr(deeptabular, "output_dim"):
                    raise AttributeError(err_msg)
                # the following assertion is thought for those cases where we
                # use fusion with 'dot product' so that the output_dim will
                # be 1 and the pred_dim is not 1
                if deeptabular.output_dim == 1:
                    assert pred_dim == 1, "If 'output_dim' is 1, 'pred_dim' must be 1"

        if deeptabular is not None:
            is_tabnet = False
            if isinstance(deeptabular, list):
                is_any_tabnet = any(
                    dt.__class__.__name__ == "TabNet" for dt in deeptabular
                )
                if is_any_tabnet:
                    raise ValueError(
                        "Currently TabNet is not supported as a component of a multiple "
                        "tabular component model."
                    )
            else:
                is_tabnet = deeptabular.__class__.__name__ == "TabNet"
            has_wide_text_or_image = (
                wide is not None or deeptext is not None or deepimage is not None
            )
            if is_tabnet and has_wide_text_or_image:
                warnings.warn(
                    "'WideDeep' is a model comprised by multiple components and the 'deeptabular'"
                    " component is 'TabNet'. We recommend using 'TabNet' in isolation."
                    " The reasons are: i)'TabNet' uses sparse regularization which partially losses"
                    " its purpose when used in combination with other components."
                    " If you still want to use a multiple component model with 'TabNet',"
                    " consider setting 'lambda_sparse' to 0 during training. ii) The feature"
                    " importances will be computed only for TabNet but the model will comprise multiple"
                    " components. Therefore, such importances will partially lose their 'meaning'.",
                    UserWarning,
                )

        if deeptext is not None:
            err_msg = "deeptext model must have an 'output_dim' attribute or property."
            if isinstance(deeptext, list):
                all_have_output_dim = all(hasattr(dt, "output_dim") for dt in deeptext)
                if not all_have_output_dim:
                    raise AttributeError(err_msg)
            else:
                if not hasattr(deeptext, "output_dim"):
                    raise AttributeError(err_msg)
                if deeptext.output_dim == 1:
                    assert pred_dim == 1, "If 'output_dim' is 1, 'pred_dim' must be 1"

        if deepimage is not None:
            err_msg = "deepimage model must have an 'output_dim' attribute or property."
            if isinstance(deepimage, list):
                all_have_output_dim = all(hasattr(di, "output_dim") for di in deepimage)
                if not all_have_output_dim:
                    raise AttributeError(err_msg)
            else:
                if not hasattr(deepimage, "output_dim"):
                    raise AttributeError(err_msg)
                if deepimage.output_dim == 1:
                    assert pred_dim == 1, "If 'output_dim' is 1, 'pred_dim' must be 1"

        if deephead is not None and head_hidden_dims is not None:
            raise ValueError(
                "both 'deephead' and 'head_hidden_dims' are not None. Use one of the other, but not both"
            )
        if (
            head_hidden_dims is not None
            and not deeptabular
            and not deeptext
            and not deepimage
        ):
            raise ValueError(
                "if 'head_hidden_dims' is not None, at least one deep component must be used"
            )

        if deephead is not None:
            if not hasattr(deephead, "output_dim"):
                raise AttributeError(
                    "As any other custom model passed to 'WideDeep', 'deephead' must have an "
                    "'output_dim' attribute or property. "
                )
            deephead_inp_feat = next(deephead.parameters()).size(1)
            output_dim = 0
            if deeptabular is not None:
                if isinstance(deeptabular, list):
                    for dt in deeptabular:
                        output_dim += dt.output_dim
                else:
                    output_dim += deeptabular.output_dim
            if deeptext is not None:
                if isinstance(deeptext, list):
                    for dt in deeptext:
                        output_dim += dt.output_dim
                else:
                    output_dim += deeptext.output_dim
            if deepimage is not None:
                if isinstance(deepimage, list):
                    for di in deepimage:
                        output_dim += di.output_dim
                else:
                    output_dim += deepimage.output_dim
            if deephead_inp_feat != output_dim:
                warnings.warn(
                    "A custom 'deephead' is used and it seems that the input features "
                    "do not match the output of the deep components",
                    UserWarning,
                )