Skip to content

The preprocessing module

This module contains the classes that are used to prepare the data before being passed to the models. There is one Preprocessor per data mode or model component (wide, deeptabular, deepimage and deeptext) with the exception of the deeptext component. In this case, two processors are available: one for the case when no Hugging Face model is used (TextPreprocessor) and another one when a Hugging Face model is used (HFPreprocessor).

WidePreprocessor

Bases: BasePreprocessor

Preprocessor to prepare the wide input dataset

This Preprocessor prepares the data for the wide, linear component. This linear model is implemented via an Embedding layer that is connected to the output neuron. WidePreprocessor numerically encodes all the unique values of all categorical columns wide_cols + crossed_cols. See the Example below.

Parameters:

Name Type Description Default
wide_cols List[str]

List of strings with the name of the columns that will label encoded and passed through the wide component

required
crossed_cols Optional[List[Tuple[str, str]]]

List of Tuples with the name of the columns that will be 'crossed' and then label encoded. e.g. [('education', 'occupation'), ...]. For binary features, a cross-product transformation is 1 if and only if the constituent features are all 1, and 0 otherwise.

None

Attributes:

Name Type Description
wide_crossed_cols List

List with the names of all columns that will be label encoded

encoding_dict Dict

Dictionary where the keys are the result of pasting colname + '_' + column value and the values are the corresponding mapped integer.

inverse_encoding_dict Dict

the inverse encoding dictionary

wide_dim int

Dimension of the wide model (i.e. dim of the linear layer)

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import WidePreprocessor
>>> df = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l']})
>>> wide_cols = ['color']
>>> crossed_cols = [('color', 'size')]
>>> wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
>>> X_wide = wide_preprocessor.fit_transform(df)
>>> X_wide
array([[1, 4],
       [2, 5],
       [3, 6]])
>>> wide_preprocessor.encoding_dict
{'color_r': 1, 'color_b': 2, 'color_g': 3, 'color_size_r-s': 4, 'color_size_b-n': 5, 'color_size_g-l': 6}
>>> wide_preprocessor.inverse_transform(X_wide)
  color color_size
0     r        r-s
1     b        b-n
2     g        g-l
Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
class WidePreprocessor(BasePreprocessor):
    r"""Preprocessor to prepare the wide input dataset

    This Preprocessor prepares the data for the wide, linear component.
    This linear model is implemented via an Embedding layer that is
    connected to the output neuron. `WidePreprocessor` numerically
    encodes all the unique values of all categorical columns `wide_cols +
    crossed_cols`. See the Example below.

    Parameters
    ----------
    wide_cols: List
        List of strings with the name of the columns that will label
        encoded and passed through the `wide` component
    crossed_cols: List, default = None
        List of Tuples with the name of the columns that will be `'crossed'`
        and then label encoded. e.g. _[('education', 'occupation'), ...]_. For
        binary features, a cross-product transformation is 1 if and only if
        the constituent features are all 1, and 0 otherwise.

    Attributes
    ----------
    wide_crossed_cols: List
        List with the names of all columns that will be label encoded
    encoding_dict: Dict
        Dictionary where the keys are the result of pasting `colname + '_' +
        column value` and the values are the corresponding mapped integer.
    inverse_encoding_dict: Dict
        the inverse encoding dictionary
    wide_dim: int
        Dimension of the wide model (i.e. dim of the linear layer)

    Examples
    --------
    >>> import pandas as pd
    >>> from pytorch_widedeep.preprocessing import WidePreprocessor
    >>> df = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l']})
    >>> wide_cols = ['color']
    >>> crossed_cols = [('color', 'size')]
    >>> wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
    >>> X_wide = wide_preprocessor.fit_transform(df)
    >>> X_wide
    array([[1, 4],
           [2, 5],
           [3, 6]])
    >>> wide_preprocessor.encoding_dict
    {'color_r': 1, 'color_b': 2, 'color_g': 3, 'color_size_r-s': 4, 'color_size_b-n': 5, 'color_size_g-l': 6}
    >>> wide_preprocessor.inverse_transform(X_wide)
      color color_size
    0     r        r-s
    1     b        b-n
    2     g        g-l
    """

    def __init__(
        self, wide_cols: List[str], crossed_cols: Optional[List[Tuple[str, str]]] = None
    ):
        super(WidePreprocessor, self).__init__()

        self.wide_cols = wide_cols
        self.crossed_cols = crossed_cols

        self.is_fitted = False

    def fit(self, df: pd.DataFrame) -> "WidePreprocessor":
        r"""Fits the Preprocessor and creates required attributes

        Parameters
        ----------
        df: pd.DataFrame
            Input pandas dataframe

        Returns
        -------
        WidePreprocessor
            `WidePreprocessor` fitted object
        """
        df_wide = self._prepare_wide(df)
        self.wide_crossed_cols = df_wide.columns.tolist()
        glob_feature_list = self._make_global_feature_list(
            df_wide[self.wide_crossed_cols]
        )
        # leave 0 for padding/"unseen" categories
        self.encoding_dict = {v: i + 1 for i, v in enumerate(glob_feature_list)}
        self.wide_dim = len(self.encoding_dict)
        self.inverse_encoding_dict = {k: v for v, k in self.encoding_dict.items()}
        self.inverse_encoding_dict[0] = "unseen"

        self.is_fitted = True

        return self

    def transform(self, df: pd.DataFrame) -> np.ndarray:
        r"""
        Parameters
        ----------
        df: pd.DataFrame
            Input pandas dataframe

        Returns
        -------
        np.ndarray
            transformed input dataframe
        """
        check_is_fitted(self, attributes=["encoding_dict"])
        df_wide = self._prepare_wide(df)
        encoded = np.zeros([len(df_wide), len(self.wide_crossed_cols)])
        for col_i, col in enumerate(self.wide_crossed_cols):
            encoded[:, col_i] = df_wide[col].apply(
                lambda x: (
                    self.encoding_dict[col + "_" + str(x)]
                    if col + "_" + str(x) in self.encoding_dict
                    else 0
                )
            )
        return encoded.astype("int64")

    def transform_sample(self, df: pd.DataFrame) -> np.ndarray:
        return self.transform(df)[0]

    def inverse_transform(self, encoded: np.ndarray) -> pd.DataFrame:
        r"""Takes as input the output from the `transform` method and it will
        return the original values.

        Parameters
        ----------
        encoded: np.ndarray
            numpy array with the encoded values that are the output from the
            `transform` method

        Returns
        -------
        pd.DataFrame
            Pandas dataframe with the original values
        """
        decoded = pd.DataFrame(encoded, columns=self.wide_crossed_cols)

        if pd.__version__ >= "2.1.0":
            decoded = decoded.map(lambda x: self.inverse_encoding_dict[x])
        else:
            decoded = decoded.applymap(lambda x: self.inverse_encoding_dict[x])

        for col in decoded.columns:
            rm_str = "".join([col, "_"])
            decoded[col] = decoded[col].apply(lambda x: x.replace(rm_str, ""))
        return decoded

    def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
        """Combines `fit` and `transform`

        Parameters
        ----------
        df: pd.DataFrame
            Input pandas dataframe

        Returns
        -------
        np.ndarray
            transformed input dataframe
        """
        return self.fit(df).transform(df)

    def _make_global_feature_list(self, df: pd.DataFrame) -> List:
        glob_feature_list = []
        for column in df.columns:
            glob_feature_list += self._make_column_feature_list(df[column])
        return glob_feature_list

    def _make_column_feature_list(self, s: pd.Series) -> List:
        return [s.name + "_" + str(x) for x in s.unique()]

    def _cross_cols(self, df: pd.DataFrame):
        df_cc = df.copy()
        crossed_colnames = []
        for cols in self.crossed_cols:
            for c in cols:
                df_cc[c] = df_cc[c].astype("str")
            colname = "_".join(cols)
            df_cc[colname] = df_cc[list(cols)].apply(lambda x: "-".join(x), axis=1)
            crossed_colnames.append(colname)
        return df_cc[crossed_colnames]

    def _prepare_wide(self, df: pd.DataFrame):
        if self.crossed_cols is not None:
            df_cc = self._cross_cols(df)
            return pd.concat([df[self.wide_cols], df_cc], axis=1)
        else:
            return df.copy()[self.wide_cols]

    def __repr__(self) -> str:
        list_of_params: List[str] = ["wide_cols={wide_cols}"]
        if self.crossed_cols is not None:
            list_of_params.append("crossed_cols={crossed_cols}")
        all_params = ", ".join(list_of_params)
        return f"WidePreprocessor({all_params.format(**self.__dict__)})"

fit

fit(df)

Fits the Preprocessor and creates required attributes

Parameters:

Name Type Description Default
df DataFrame

Input pandas dataframe

required

Returns:

Type Description
WidePreprocessor

WidePreprocessor fitted object

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def fit(self, df: pd.DataFrame) -> "WidePreprocessor":
    r"""Fits the Preprocessor and creates required attributes

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    WidePreprocessor
        `WidePreprocessor` fitted object
    """
    df_wide = self._prepare_wide(df)
    self.wide_crossed_cols = df_wide.columns.tolist()
    glob_feature_list = self._make_global_feature_list(
        df_wide[self.wide_crossed_cols]
    )
    # leave 0 for padding/"unseen" categories
    self.encoding_dict = {v: i + 1 for i, v in enumerate(glob_feature_list)}
    self.wide_dim = len(self.encoding_dict)
    self.inverse_encoding_dict = {k: v for v, k in self.encoding_dict.items()}
    self.inverse_encoding_dict[0] = "unseen"

    self.is_fitted = True

    return self

transform

transform(df)

Parameters:

Name Type Description Default
df DataFrame

Input pandas dataframe

required

Returns:

Type Description
ndarray

transformed input dataframe

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def transform(self, df: pd.DataFrame) -> np.ndarray:
    r"""
    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    check_is_fitted(self, attributes=["encoding_dict"])
    df_wide = self._prepare_wide(df)
    encoded = np.zeros([len(df_wide), len(self.wide_crossed_cols)])
    for col_i, col in enumerate(self.wide_crossed_cols):
        encoded[:, col_i] = df_wide[col].apply(
            lambda x: (
                self.encoding_dict[col + "_" + str(x)]
                if col + "_" + str(x) in self.encoding_dict
                else 0
            )
        )
    return encoded.astype("int64")

inverse_transform

inverse_transform(encoded)

Takes as input the output from the transform method and it will return the original values.

Parameters:

Name Type Description Default
encoded ndarray

numpy array with the encoded values that are the output from the transform method

required

Returns:

Type Description
DataFrame

Pandas dataframe with the original values

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
def inverse_transform(self, encoded: np.ndarray) -> pd.DataFrame:
    r"""Takes as input the output from the `transform` method and it will
    return the original values.

    Parameters
    ----------
    encoded: np.ndarray
        numpy array with the encoded values that are the output from the
        `transform` method

    Returns
    -------
    pd.DataFrame
        Pandas dataframe with the original values
    """
    decoded = pd.DataFrame(encoded, columns=self.wide_crossed_cols)

    if pd.__version__ >= "2.1.0":
        decoded = decoded.map(lambda x: self.inverse_encoding_dict[x])
    else:
        decoded = decoded.applymap(lambda x: self.inverse_encoding_dict[x])

    for col in decoded.columns:
        rm_str = "".join([col, "_"])
        decoded[col] = decoded[col].apply(lambda x: x.replace(rm_str, ""))
    return decoded

fit_transform

fit_transform(df)

Combines fit and transform

Parameters:

Name Type Description Default
df DataFrame

Input pandas dataframe

required

Returns:

Type Description
ndarray

transformed input dataframe

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
159
160
161
162
163
164
165
166
167
168
169
170
171
172
def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    return self.fit(df).transform(df)

TabPreprocessor

Bases: BasePreprocessor

Preprocessor to prepare the deeptabular component input dataset

Parameters:

Name Type Description Default
cat_embed_cols Optional[Union[List[str], List[Tuple[str, int]]]]

List containing the name of the categorical columns that will be represented by embeddings (e.g. ['education', 'relationship', ...]) or a Tuple with the name and the embedding dimension (e.g.: [ ('education',32), ('relationship',16), ...]).
Param alias: embed_cols

None
continuous_cols Optional[List[str]]

List with the name of the continuous cols

None
quantization_setup Optional[Union[int, Dict[str, Union[int, List[float]]]]]

Continuous columns can be turned into categorical via pd.cut. If quantization_setup is an int, all continuous columns will be quantized using this value as the number of bins. Alternatively, a dictionary where the keys are the column names to quantize and the values are the either integers indicating the number of bins or a list of scalars indicating the bin edges can also be used.
Param alias: cols_and_bins

None
cols_to_scale Optional[Union[List[str], str]]

List with the names of the columns that will be standarised via sklearn's StandardScaler. It can also be the string 'all' in which case all the continuous cols will be scaled.

None
auto_embed_dim bool

Boolean indicating whether the embedding dimensions will be automatically defined via rule of thumb. See embedding_rule below.

True
embedding_rule Literal[google, fastai_old, fastai_new]

If auto_embed_dim=True, this is the choice of embedding rule of thumb. Choices are:

  • fastai_new: \(min(600, round(1.6 \times n_{cat}^{0.56}))\)

  • fastai_old: \(min(50, (n_{cat}//{2})+1)\)

  • google: \(min(600, round(n_{cat}^{0.24}))\)

'fastai_new'
default_embed_dim int

Dimension for the embeddings if the embedding dimension is not provided in the cat_embed_cols parameter and auto_embed_dim is set to False.

16
with_attention bool

Boolean indicating whether the preprocessed data will be passed to an attention-based model (more precisely a model where all embeddings must have the same dimensions). If True, the param cat_embed_cols must just be a list containing just the categorical column names: e.g. ['education', 'relationship', ...]. This is because they will all be encoded using embeddings of the same dim, which will be specified later when the model is defined.
Param alias: for_transformer, for_matrix_factorization, for_mf

False
with_cls_token bool

Boolean indicating if a '[CLS]' token will be added to the dataset when using attention-based models. The final hidden state corresponding to this token is used as the aggregated representation for classification and regression tasks. If not, the categorical and/or continuous embeddings will be concatenated before being passed to the final MLP (if present).

False
shared_embed bool

Boolean indicating if the embeddings will be "shared" when using attention-based models. The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

False
verbose int
1
scale bool

ℹ️ note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
Bool indicating whether or not to scale/standarise continuous cols. It is important to emphasize that all the DL models for tabular data in the library also include the possibility of normalising the input continuous features via a BatchNorm or a LayerNorm.
Param alias: scale_cont_cols.

False
already_standard Optional[List[str]]

ℹ️ note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
List with the name of the continuous cols that do not need to be scaled/standarised.

None

Other Parameters:

Name Type Description
**kwargs

pd.cut and StandardScaler related args

Attributes:

Name Type Description
cat_cols List[str]

List containing the names of the categorical columns after processing.

cols_and_bins Optional[Dict[str, Union[int, List[float]]]]

Dictionary containing the quantization setup after processing if quantization is requested. This is derived from the quantization_setup parameter.

quant_args Dict

Dictionary containing arguments passed to pandas.cut() function for quantization.

scale_args Dict

Dictionary containing arguments passed to StandardScaler.

label_encoder LabelEncoder

Instance of LabelEncoder used to encode categorical variables. See pytorch_widedeep.utils.dense_utils.LabelEncoder.

cat_embed_input List

List of tuples with the column name, number of individual values for that column and, if with_attention is set to False, the corresponding embeddings dim, e.g. [('education', 16, 10), ('relationship', 6, 8), ...].

standardize_cols List

List of the columns that will be standardized.

scaler StandardScaler

An instance of sklearn.preprocessing.StandardScaler if standardization is requested.

column_idx Dict

Dictionary where keys are column names and values are column indexes. This is necessary to slice tensors.

quantizer Quantizer

An instance of Quantizer if quantization is requested.

is_fitted bool

Boolean indicating if the preprocessor has been fitted.

Examples:

>>> import pandas as pd
>>> import numpy as np
>>> from pytorch_widedeep.preprocessing import TabPreprocessor
>>> df = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l'], 'age': [25, 40, 55]})
>>> cat_embed_cols = [('color',5), ('size',5)]
>>> cont_cols = ['age']
>>> deep_preprocessor = TabPreprocessor(cat_embed_cols=cat_embed_cols, continuous_cols=cont_cols)
>>> X_tab = deep_preprocessor.fit_transform(df)
>>> deep_preprocessor.cat_embed_cols
[('color', 5), ('size', 5)]
>>> deep_preprocessor.column_idx
{'color': 0, 'size': 1, 'age': 2}
>>> cont_df = pd.DataFrame({"col1": np.random.rand(10), "col2": np.random.rand(10) + 1})
>>> cont_cols = ["col1", "col2"]
>>> tab_preprocessor = TabPreprocessor(continuous_cols=cont_cols, quantization_setup=3)
>>> ft_cont_df = tab_preprocessor.fit_transform(cont_df)
>>> # or...
>>> quantization_setup = {'col1': [0., 0.4, 1.], 'col2': [1., 1.4, 2.]}
>>> tab_preprocessor2 = TabPreprocessor(continuous_cols=cont_cols, quantization_setup=quantization_setup)
>>> ft_cont_df2 = tab_preprocessor2.fit_transform(cont_df)
Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
class TabPreprocessor(BasePreprocessor):
    r"""Preprocessor to prepare the `deeptabular` component input dataset

    Parameters
    ----------
    cat_embed_cols: List, default = None
        List containing the name of the categorical columns that will be
        represented by embeddings (e.g. _['education', 'relationship', ...]_) or
        a Tuple with the name and the embedding dimension (e.g.: _[
        ('education',32), ('relationship',16), ...]_).<br/> Param alias: `embed_cols`
    continuous_cols: List, default = None
        List with the name of the continuous cols
    quantization_setup: int or Dict, default = None
        Continuous columns can be turned into categorical via `pd.cut`. If
        `quantization_setup` is an `int`, all continuous columns will be
        quantized using this value as the number of bins. Alternatively, a
        dictionary where the keys are the column names to quantize and the
        values are the either integers indicating the number of bins or a
        list of scalars indicating the bin edges can also be used.<br/>
        Param alias: `cols_and_bins`
    cols_to_scale: List or str, default = None,
        List with the names of the columns that will be standarised via
        sklearn's `StandardScaler`. It can also be the string `'all'` in
        which case all the continuous cols will be scaled.
    auto_embed_dim: bool, default = True
        Boolean indicating whether the embedding dimensions will be
        automatically defined via rule of thumb. See `embedding_rule`
        below.
    embedding_rule: str, default = 'fastai_new'
        If `auto_embed_dim=True`, this is the choice of embedding rule of
        thumb. Choices are:

        - _fastai_new_: $min(600, round(1.6 \times n_{cat}^{0.56}))$

        - _fastai_old_: $min(50, (n_{cat}//{2})+1)$

        - _google_: $min(600, round(n_{cat}^{0.24}))$
    default_embed_dim: int, default=16
        Dimension for the embeddings if the embedding dimension is not
        provided in the `cat_embed_cols` parameter and `auto_embed_dim` is
        set to `False`.
    with_attention: bool, default = False
        Boolean indicating whether the preprocessed data will be passed to an
        attention-based model (more precisely a model where all embeddings
        must have the same dimensions). If `True`, the param `cat_embed_cols`
        must just be a list containing just the categorical column names:
        e.g.
        _['education', 'relationship', ...]_. This is because they will all be
         encoded using embeddings of the same dim, which will be specified
         later when the model is defined. <br/> Param alias:
         `for_transformer`, `for_matrix_factorization`, `for_mf`
    with_cls_token: bool, default = False
        Boolean indicating if a `'[CLS]'` token will be added to the dataset
        when using attention-based models. The final hidden state
        corresponding to this token is used as the aggregated representation
        for classification and regression tasks. If not, the categorical
        and/or continuous embeddings will be concatenated before being passed
        to the final MLP (if present).
    shared_embed: bool, default = False
        Boolean indicating if the embeddings will be "shared" when using
        attention-based models. The idea behind `shared_embed` is
        described in the Appendix A in the [TabTransformer paper](https://arxiv.org/abs/2012.06678):
        _'The goal of having column embedding is to enable the model to
        distinguish the classes in one column from those in the other
        columns'_. In other words, the idea is to let the model learn which
        column is embedded at the time. See: `pytorch_widedeep.models.transformers._layers.SharedEmbeddings`.
    verbose: int, default = 1
    scale: bool, default = False
        :information_source: **note**: this arg will be removed in upcoming
         releases. Please use `cols_to_scale` instead. <br/> Bool indicating
         whether or not to scale/standarise continuous cols. It is important
         to emphasize that all the DL models for tabular data in the library
         also include the possibility of normalising the input continuous
         features via a `BatchNorm` or a `LayerNorm`. <br/> Param alias:
         `scale_cont_cols`.
    already_standard: List, default = None
        :information_source: **note**: this arg will be removed in upcoming
         releases. Please use `cols_to_scale` instead. <br/> List with the
         name of the continuous cols that do not need to be
         scaled/standarised.

    Other Parameters
    ----------------
    **kwargs: dict
        `pd.cut` and `StandardScaler` related args

    Attributes
    ----------
    cat_cols: List[str]
        List containing the names of the categorical columns after processing.
    cols_and_bins: Optional[Dict[str, Union[int, List[float]]]]
        Dictionary containing the quantization setup after processing if
        quantization is requested. This is derived from the
        quantization_setup parameter.
    quant_args: Dict
        Dictionary containing arguments passed to pandas.cut() function for quantization.
    scale_args: Dict
        Dictionary containing arguments passed to StandardScaler.
    label_encoder: LabelEncoder
        Instance of LabelEncoder used to encode categorical variables.
        See `pytorch_widedeep.utils.dense_utils.LabelEncoder`.
    cat_embed_input: List
        List of tuples with the column name, number of individual values for
        that column and, if `with_attention` is set to `False`, the
        corresponding embeddings dim, e.g. _[('education', 16, 10),
        ('relationship', 6, 8), ...]_.
    standardize_cols: List
        List of the columns that will be standardized.
    scaler: StandardScaler
        An instance of `sklearn.preprocessing.StandardScaler` if standardization
        is requested.
    column_idx: Dict
        Dictionary where keys are column names and values are column indexes.
        This is necessary to slice tensors.
    quantizer: Quantizer
        An instance of `Quantizer` if quantization is requested.
    is_fitted: bool
        Boolean indicating if the preprocessor has been fitted.

    Examples
    --------
    >>> import pandas as pd
    >>> import numpy as np
    >>> from pytorch_widedeep.preprocessing import TabPreprocessor
    >>> df = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l'], 'age': [25, 40, 55]})
    >>> cat_embed_cols = [('color',5), ('size',5)]
    >>> cont_cols = ['age']
    >>> deep_preprocessor = TabPreprocessor(cat_embed_cols=cat_embed_cols, continuous_cols=cont_cols)
    >>> X_tab = deep_preprocessor.fit_transform(df)
    >>> deep_preprocessor.cat_embed_cols
    [('color', 5), ('size', 5)]
    >>> deep_preprocessor.column_idx
    {'color': 0, 'size': 1, 'age': 2}
    >>> cont_df = pd.DataFrame({"col1": np.random.rand(10), "col2": np.random.rand(10) + 1})
    >>> cont_cols = ["col1", "col2"]
    >>> tab_preprocessor = TabPreprocessor(continuous_cols=cont_cols, quantization_setup=3)
    >>> ft_cont_df = tab_preprocessor.fit_transform(cont_df)
    >>> # or...
    >>> quantization_setup = {'col1': [0., 0.4, 1.], 'col2': [1., 1.4, 2.]}
    >>> tab_preprocessor2 = TabPreprocessor(continuous_cols=cont_cols, quantization_setup=quantization_setup)
    >>> ft_cont_df2 = tab_preprocessor2.fit_transform(cont_df)
    """

    @alias("with_attention", ["for_transformer", "for_matrix_factorization", "for_mf"])
    @alias("cat_embed_cols", ["embed_cols"])
    @alias("scale", ["scale_cont_cols"])
    @alias("quantization_setup", ["cols_and_bins"])
    def __init__(
        self,
        cat_embed_cols: Optional[Union[List[str], List[Tuple[str, int]]]] = None,
        continuous_cols: Optional[List[str]] = None,
        quantization_setup: Optional[
            Union[int, Dict[str, Union[int, List[float]]]]
        ] = None,
        cols_to_scale: Optional[Union[List[str], str]] = None,
        auto_embed_dim: bool = True,
        embedding_rule: Literal["google", "fastai_old", "fastai_new"] = "fastai_new",
        default_embed_dim: int = 16,
        with_attention: bool = False,
        with_cls_token: bool = False,
        shared_embed: bool = False,
        verbose: int = 1,
        *,
        scale: bool = False,
        already_standard: Optional[List[str]] = None,
        **kwargs,
    ):
        super(TabPreprocessor, self).__init__()

        self.continuous_cols = continuous_cols
        self.quantization_setup = quantization_setup
        self.cols_to_scale = cols_to_scale
        self.scale = scale
        self.already_standard = already_standard
        self.auto_embed_dim = auto_embed_dim
        self.embedding_rule = embedding_rule
        self.default_embed_dim = default_embed_dim
        self.with_attention = with_attention
        self.with_cls_token = with_cls_token
        self.shared_embed = shared_embed
        self.verbose = verbose

        self.quant_args = {
            k: v for k, v in kwargs.items() if k in pd.cut.__code__.co_varnames
        }
        self.scale_args = {
            k: v for k, v in kwargs.items() if k in StandardScaler().get_params()
        }

        self._check_inputs(cat_embed_cols)

        if with_cls_token:
            self.cat_embed_cols = (
                ["cls_token"] + cat_embed_cols  # type: ignore[operator]
                if cat_embed_cols is not None
                else ["cls_token"]
            )
        else:
            self.cat_embed_cols = cat_embed_cols  # type: ignore[assignment]

        self.is_fitted = False

    def fit(self, df: pd.DataFrame) -> BasePreprocessor:  # noqa: C901
        """Fits the Preprocessor and creates required attributes

        Parameters
        ----------
        df: pd.DataFrame
            Input pandas dataframe

        Returns
        -------
        TabPreprocessor
            `TabPreprocessor` fitted object
        """

        df_adj = self._insert_cls_token(df) if self.with_cls_token else df.copy()

        self.column_idx: Dict[str, int] = {}

        # Categorical embeddings logic
        if self.cat_embed_cols is not None or self.quantization_setup is not None:
            self.cat_embed_input: List[Union[Tuple[str, int], Tuple[str, int, int]]] = (
                []
            )

        if self.cat_embed_cols is not None:
            df_cat, cat_embed_dim = self._prepare_categorical(df_adj)

            self.label_encoder = LabelEncoder(
                columns_to_encode=df_cat.columns.tolist(),
                shared_embed=self.shared_embed,
                with_attention=self.with_attention,
            )
            self.label_encoder.fit(df_cat)

            for k, v in self.label_encoder.encoding_dict.items():
                if self.with_attention:
                    self.cat_embed_input.append((k, len(v)))
                else:
                    self.cat_embed_input.append((k, len(v), cat_embed_dim[k]))

            self.column_idx.update({k: v for v, k in enumerate(df_cat.columns)})

        # Continuous columns logic
        if self.continuous_cols is not None:
            df_cont, cont_embed_dim = self._prepare_continuous(df_adj)

            # Standardization logic
            if self.standardize_cols is not None:
                self.scaler = StandardScaler(**self.scale_args).fit(
                    df_cont[self.standardize_cols].values
                )
            elif self.verbose:
                warnings.warn("Continuous columns will not be normalised")

            # Quantization logic
            if self.cols_and_bins is not None:
                # we do not run 'Quantizer.fit' here since in the wild case
                # someone wants standardization and quantization for the same
                # columns, the Quantizer will run on the scaled data
                self.quantizer = Quantizer(self.cols_and_bins, **self.quant_args)

                if self.with_attention:
                    for col, n_cat, _ in cont_embed_dim:
                        self.cat_embed_input.append((col, n_cat))
                else:
                    self.cat_embed_input.extend(cont_embed_dim)

            self.column_idx.update(
                {k: v + len(self.column_idx) for v, k in enumerate(df_cont)}
            )

        self.is_fitted = True

        return self

    def transform(self, df: pd.DataFrame) -> np.ndarray:  # noqa: C901
        """Returns the processed `dataframe` as a np.ndarray

        Parameters
        ----------
        df: pd.DataFrame
            Input pandas dataframe

        Returns
        -------
        np.ndarray
            transformed input dataframe
        """
        check_is_fitted(self, condition=self.is_fitted)

        df_adj = self._insert_cls_token(df) if self.with_cls_token else df.copy()

        if self.cat_embed_cols is not None:
            df_cat = df_adj[self.cat_cols]
            df_cat = self.label_encoder.transform(df_cat)
        if self.continuous_cols is not None:
            df_cont = df_adj[self.continuous_cols]
            # Standardization logic
            if self.standardize_cols:
                df_cont[self.standardize_cols] = self.scaler.transform(
                    df_cont[self.standardize_cols].values
                )
            # Quantization logic
            if self.cols_and_bins is not None:
                # Adjustment so I don't have to override the method
                # in 'ChunkTabPreprocessor'
                if self.quantizer.is_fitted:
                    df_cont = self.quantizer.transform(df_cont)
                else:
                    df_cont = self.quantizer.fit_transform(df_cont)
        try:
            df_deep = pd.concat([df_cat, df_cont], axis=1)
        except NameError:
            try:
                df_deep = df_cat.copy()
            except NameError:
                df_deep = df_cont.copy()

        return df_deep.values

    def transform_sample(self, df: pd.DataFrame) -> np.ndarray:
        return self.transform(df).astype("float")[0]

    def inverse_transform(self, encoded: np.ndarray) -> pd.DataFrame:  # noqa: C901
        r"""Takes as input the output from the `transform` method and it will
        return the original values.

        Parameters
        ----------
        encoded: np.ndarray
            array with the output of the `transform` method

        Returns
        -------
        pd.DataFrame
            Pandas dataframe with the original values
        """
        decoded = pd.DataFrame(encoded, columns=list(self.column_idx.keys()))
        # embeddings back to original category
        if self.cat_embed_cols is not None:
            decoded = self.label_encoder.inverse_transform(decoded)
        if self.continuous_cols is not None:
            # quantized cols to the mid point
            if self.cols_and_bins is not None:
                if self.verbose:
                    print(
                        "Note that quantized cols will be turned into the mid point of "
                        "the corresponding bin"
                    )
                for k, v in self.quantizer.inversed_bins.items():
                    decoded[k] = decoded[k].map(v)
            # continuous_cols back to non-standarised
            try:
                decoded[self.standardize_cols] = self.scaler.inverse_transform(
                    decoded[self.standardize_cols]
                )
            except Exception:  # KeyError:
                pass

        if "cls_token" in decoded.columns:
            decoded.drop("cls_token", axis=1, inplace=True)

        return decoded

    def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
        """Combines `fit` and `transform`

        Parameters
        ----------
        df: pd.DataFrame
            Input pandas dataframe

        Returns
        -------
        np.ndarray
            transformed input dataframe
        """
        return self.fit(df).transform(df)

    def _insert_cls_token(self, df: pd.DataFrame) -> pd.DataFrame:
        df_cls = df.copy()
        df_cls.insert(loc=0, column="cls_token", value="[CLS]")
        return df_cls

    def _prepare_categorical(
        self, df: pd.DataFrame
    ) -> Tuple[pd.DataFrame, Dict[str, int]]:
        if isinstance(self.cat_embed_cols[0], tuple):
            self.cat_cols: List[str] = [emb[0] for emb in self.cat_embed_cols]
            cat_embed_dim: Dict[str, int] = dict(self.cat_embed_cols)  # type: ignore
        else:
            self.cat_cols = self.cat_embed_cols  # type: ignore[assignment]
            if self.auto_embed_dim:
                assert isinstance(self.cat_embed_cols[0], str), (
                    "If 'auto_embed_dim' is 'True' and 'with_attention' is 'False', "
                    "'cat_embed_cols' must be a list of strings with the columns to "
                    "be encoded as embeddings."
                )
                n_cats: Dict[str, int] = {col: df[col].nunique() for col in self.cat_embed_cols}  # type: ignore[misc]
                cat_embed_dim = {
                    col: embed_sz_rule(n_cat, self.embedding_rule)  # type: ignore[misc]
                    for col, n_cat in n_cats.items()
                }
            else:
                cat_embed_dim = {
                    e: self.default_embed_dim for e in self.cat_embed_cols  # type: ignore[misc]
                }  # type: ignore
        return df[self.cat_cols], cat_embed_dim

    def _prepare_continuous(
        self, df: pd.DataFrame
    ) -> Tuple[pd.DataFrame, Optional[List[Tuple[str, int, int]]]]:
        # Standardization logic
        if self.cols_to_scale is not None:
            self.standardize_cols = (
                self.cols_to_scale
                if self.cols_to_scale != "all"
                else self.continuous_cols
            )
        elif self.scale:
            if self.already_standard is not None:
                self.standardize_cols = [
                    c for c in self.continuous_cols if c not in self.already_standard
                ]
            else:
                self.standardize_cols = self.continuous_cols
        else:
            self.standardize_cols = None

        # Quantization logic
        if self.quantization_setup is not None:
            # the quantized columns are then treated as categorical
            quant_cont_embed_input: Optional[List[Tuple[str, int, int]]] = []
            if isinstance(self.quantization_setup, int):
                self.cols_and_bins: Optional[Dict[str, Union[int, List[float]]]] = {}
                for col in self.continuous_cols:
                    self.cols_and_bins[col] = self.quantization_setup
                    quant_cont_embed_input.append(
                        (
                            col,
                            self.quantization_setup,
                            (
                                embed_sz_rule(
                                    self.quantization_setup + 1, self.embedding_rule  # type: ignore[arg-type]
                                )
                                if self.auto_embed_dim
                                else self.default_embed_dim
                            ),
                        )
                    )
            else:
                for col, val in self.quantization_setup.items():
                    if isinstance(val, int):
                        quant_cont_embed_input.append(
                            (
                                col,
                                val,
                                (
                                    embed_sz_rule(
                                        val + 1, self.embedding_rule  # type: ignore[arg-type]
                                    )
                                    if self.auto_embed_dim
                                    else self.default_embed_dim
                                ),
                            )
                        )
                    else:
                        quant_cont_embed_input.append(
                            (
                                col,
                                len(val) - 1,
                                (
                                    embed_sz_rule(len(val), self.embedding_rule)  # type: ignore[arg-type]
                                    if self.auto_embed_dim
                                    else self.default_embed_dim
                                ),
                            )
                        )

                self.cols_and_bins = self.quantization_setup.copy()
        else:
            self.cols_and_bins = None
            quant_cont_embed_input = None

        return df[self.continuous_cols], quant_cont_embed_input

    def _check_inputs(self, cat_embed_cols):  # noqa: C901
        if self.scale or self.already_standard is not None:
            warnings.warn(
                "'scale' and 'already_standard' will be deprecated in the next release. "
                "Please use 'cols_to_scale' instead",
                DeprecationWarning,
                stacklevel=2,
            )

        if self.scale:
            if self.already_standard is not None:
                standardize_cols = [
                    c for c in self.continuous_cols if c not in self.already_standard
                ]
            else:
                standardize_cols = self.continuous_cols
        elif self.cols_to_scale is not None:
            standardize_cols = self.cols_to_scale
        else:
            standardize_cols = None

        if standardize_cols is not None:
            if isinstance(self.quantization_setup, int):
                cols_to_quantize_and_standardize = [
                    c for c in standardize_cols if c in self.continuous_cols
                ]
            elif isinstance(self.quantization_setup, dict):
                cols_to_quantize_and_standardize = [
                    c for c in standardize_cols if c in self.quantization_setup
                ]
            else:
                cols_to_quantize_and_standardize = None
            if cols_to_quantize_and_standardize is not None:
                warnings.warn(
                    f"the following columns: {cols_to_quantize_and_standardize} will be first scaled"
                    " using a StandardScaler and then quantized. Make sure this is what you really want"
                )

        if self.with_cls_token and not self.with_attention:
            warnings.warn(
                "If 'with_cls_token' is set to 'True', 'with_attention' will be automatically "
                "to 'True' if is 'False'",
                UserWarning,
            )
            self.with_attention = True

        if (cat_embed_cols is None) and (self.continuous_cols is None):
            raise ValueError(
                "'cat_embed_cols' and 'continuous_cols' are 'None'. Please, define at least one of the two."
            )

        if (
            cat_embed_cols is not None
            and self.continuous_cols is not None
            and len(np.intersect1d(cat_embed_cols, self.continuous_cols)) > 0
        ):
            overlapping_cols = list(
                np.intersect1d(cat_embed_cols, self.continuous_cols)
            )
            raise ValueError(
                "Currently passing columns as both categorical and continuum is not supported."
                " Please, choose one or the other for the following columns: {}".format(
                    ", ".join(overlapping_cols)
                )
            )
        transformer_error_message = (
            "If with_attention is 'True' cat_embed_cols must be a list "
            " of strings with the columns to be encoded as embeddings."
        )
        if (
            self.with_attention
            and cat_embed_cols is not None
            and isinstance(cat_embed_cols[0], tuple)
        ):
            raise ValueError(transformer_error_message)

    def __repr__(self) -> str:  # noqa: C901
        list_of_params: List[str] = []
        if self.cat_embed_cols is not None:
            list_of_params.append("cat_embed_cols={cat_embed_cols}")
        if self.continuous_cols is not None:
            list_of_params.append("continuous_cols={continuous_cols}")
        if self.quantization_setup is not None:
            list_of_params.append("quantization_setup={quantization_setup}")
        if self.cols_to_scale is not None:
            list_of_params.append("cols_to_scale={cols_to_scale}")
        if not self.auto_embed_dim:
            list_of_params.append("auto_embed_dim={auto_embed_dim}")
        if self.embedding_rule != "fastai_new":
            list_of_params.append("embedding_rule='{embedding_rule}'")
        if self.default_embed_dim != 16:
            list_of_params.append("default_embed_dim={default_embed_dim}")
        if self.with_attention:
            list_of_params.append("with_attention={with_attention}")
        if self.with_cls_token:
            list_of_params.append("with_cls_token={with_cls_token}")
        if self.shared_embed:
            list_of_params.append("shared_embed={shared_embed}")
        if self.verbose != 1:
            list_of_params.append("verbose={verbose}")
        if self.scale:
            list_of_params.append("scale={scale}")
        if self.already_standard is not None:
            list_of_params.append("already_standard={already_standard}")
        if len(self.quant_args) > 0:
            list_of_params.append(
                ", ".join([f"{k}" + "=" + f"{v}" for k, v in self.quant_args.items()])
            )
        if len(self.scale_args) > 0:
            list_of_params.append(
                ", ".join([f"{k}" + "=" + f"{v}" for k, v in self.scale_args.items()])
            )
        all_params = ", ".join(list_of_params)
        return f"TabPreprocessor({all_params.format(**self.__dict__)})"

fit

fit(df)

Fits the Preprocessor and creates required attributes

Parameters:

Name Type Description Default
df DataFrame

Input pandas dataframe

required

Returns:

Type Description
TabPreprocessor

TabPreprocessor fitted object

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
def fit(self, df: pd.DataFrame) -> BasePreprocessor:  # noqa: C901
    """Fits the Preprocessor and creates required attributes

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    TabPreprocessor
        `TabPreprocessor` fitted object
    """

    df_adj = self._insert_cls_token(df) if self.with_cls_token else df.copy()

    self.column_idx: Dict[str, int] = {}

    # Categorical embeddings logic
    if self.cat_embed_cols is not None or self.quantization_setup is not None:
        self.cat_embed_input: List[Union[Tuple[str, int], Tuple[str, int, int]]] = (
            []
        )

    if self.cat_embed_cols is not None:
        df_cat, cat_embed_dim = self._prepare_categorical(df_adj)

        self.label_encoder = LabelEncoder(
            columns_to_encode=df_cat.columns.tolist(),
            shared_embed=self.shared_embed,
            with_attention=self.with_attention,
        )
        self.label_encoder.fit(df_cat)

        for k, v in self.label_encoder.encoding_dict.items():
            if self.with_attention:
                self.cat_embed_input.append((k, len(v)))
            else:
                self.cat_embed_input.append((k, len(v), cat_embed_dim[k]))

        self.column_idx.update({k: v for v, k in enumerate(df_cat.columns)})

    # Continuous columns logic
    if self.continuous_cols is not None:
        df_cont, cont_embed_dim = self._prepare_continuous(df_adj)

        # Standardization logic
        if self.standardize_cols is not None:
            self.scaler = StandardScaler(**self.scale_args).fit(
                df_cont[self.standardize_cols].values
            )
        elif self.verbose:
            warnings.warn("Continuous columns will not be normalised")

        # Quantization logic
        if self.cols_and_bins is not None:
            # we do not run 'Quantizer.fit' here since in the wild case
            # someone wants standardization and quantization for the same
            # columns, the Quantizer will run on the scaled data
            self.quantizer = Quantizer(self.cols_and_bins, **self.quant_args)

            if self.with_attention:
                for col, n_cat, _ in cont_embed_dim:
                    self.cat_embed_input.append((col, n_cat))
            else:
                self.cat_embed_input.extend(cont_embed_dim)

        self.column_idx.update(
            {k: v + len(self.column_idx) for v, k in enumerate(df_cont)}
        )

    self.is_fitted = True

    return self

transform

transform(df)

Returns the processed dataframe as a np.ndarray

Parameters:

Name Type Description Default
df DataFrame

Input pandas dataframe

required

Returns:

Type Description
ndarray

transformed input dataframe

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
def transform(self, df: pd.DataFrame) -> np.ndarray:  # noqa: C901
    """Returns the processed `dataframe` as a np.ndarray

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    check_is_fitted(self, condition=self.is_fitted)

    df_adj = self._insert_cls_token(df) if self.with_cls_token else df.copy()

    if self.cat_embed_cols is not None:
        df_cat = df_adj[self.cat_cols]
        df_cat = self.label_encoder.transform(df_cat)
    if self.continuous_cols is not None:
        df_cont = df_adj[self.continuous_cols]
        # Standardization logic
        if self.standardize_cols:
            df_cont[self.standardize_cols] = self.scaler.transform(
                df_cont[self.standardize_cols].values
            )
        # Quantization logic
        if self.cols_and_bins is not None:
            # Adjustment so I don't have to override the method
            # in 'ChunkTabPreprocessor'
            if self.quantizer.is_fitted:
                df_cont = self.quantizer.transform(df_cont)
            else:
                df_cont = self.quantizer.fit_transform(df_cont)
    try:
        df_deep = pd.concat([df_cat, df_cont], axis=1)
    except NameError:
        try:
            df_deep = df_cat.copy()
        except NameError:
            df_deep = df_cont.copy()

    return df_deep.values

inverse_transform

inverse_transform(encoded)

Takes as input the output from the transform method and it will return the original values.

Parameters:

Name Type Description Default
encoded ndarray

array with the output of the transform method

required

Returns:

Type Description
DataFrame

Pandas dataframe with the original values

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
def inverse_transform(self, encoded: np.ndarray) -> pd.DataFrame:  # noqa: C901
    r"""Takes as input the output from the `transform` method and it will
    return the original values.

    Parameters
    ----------
    encoded: np.ndarray
        array with the output of the `transform` method

    Returns
    -------
    pd.DataFrame
        Pandas dataframe with the original values
    """
    decoded = pd.DataFrame(encoded, columns=list(self.column_idx.keys()))
    # embeddings back to original category
    if self.cat_embed_cols is not None:
        decoded = self.label_encoder.inverse_transform(decoded)
    if self.continuous_cols is not None:
        # quantized cols to the mid point
        if self.cols_and_bins is not None:
            if self.verbose:
                print(
                    "Note that quantized cols will be turned into the mid point of "
                    "the corresponding bin"
                )
            for k, v in self.quantizer.inversed_bins.items():
                decoded[k] = decoded[k].map(v)
        # continuous_cols back to non-standarised
        try:
            decoded[self.standardize_cols] = self.scaler.inverse_transform(
                decoded[self.standardize_cols]
            )
        except Exception:  # KeyError:
            pass

    if "cls_token" in decoded.columns:
        decoded.drop("cls_token", axis=1, inplace=True)

    return decoded

fit_transform

fit_transform(df)

Combines fit and transform

Parameters:

Name Type Description Default
df DataFrame

Input pandas dataframe

required

Returns:

Type Description
ndarray

transformed input dataframe

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
476
477
478
479
480
481
482
483
484
485
486
487
488
489
def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    return self.fit(df).transform(df)

Quantizer

Helper class to perform the quantization of continuous columns. It is included in this docs for completion, since depending on the value of the parameter 'quantization_setup' of the TabPreprocessor class, that class might have an attribute of type Quantizer. However, this class is designed to always run internally within the TabPreprocessor class.

Parameters:

Name Type Description Default
quantization_setup Dict[str, Union[int, List[float]]]

Dictionary where the keys are the column names to quantize and the values are the either integers indicating the number of bins or a list of scalars indicating the bin edges.

required
Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
class Quantizer:
    """Helper class to perform the quantization of continuous columns. It is
    included in this docs for completion, since depending on the value of the
    parameter `'quantization_setup'` of the `TabPreprocessor` class, that
    class might have an attribute of type `Quantizer`. However, this class is
    designed to always run internally within the `TabPreprocessor` class.

    Parameters
    ----------
    quantization_setup: Dict, default = None
        Dictionary where the keys are the column names to quantize and the
        values are the either integers indicating the number of bins or a
        list of scalars indicating the bin edges.
    """

    def __init__(
        self,
        quantization_setup: Dict[str, Union[int, List[float]]],
        **kwargs,
    ):
        self.quantization_setup = quantization_setup
        self.quant_args = kwargs

        self.is_fitted = False

    def fit(self, df: pd.DataFrame) -> "Quantizer":
        self.bins: Dict[str, List[float]] = {}
        for col, bins in self.quantization_setup.items():
            _, self.bins[col] = pd.cut(
                df[col], bins, retbins=True, labels=False, **self.quant_args
            )

        self.inversed_bins: Dict[str, Dict[int, float]] = {}
        for col, bins in self.bins.items():
            self.inversed_bins[col] = {
                k: v
                for k, v in list(
                    zip(
                        range(1, len(bins) + 1),
                        [(a + b) / 2.0 for a, b in zip(bins, bins[1:])],
                    )
                )
            }
            # 0
            self.inversed_bins[col][0] = np.nan

        self.is_fitted = True

        return self

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        check_is_fitted(self, condition=self.is_fitted)

        dfc = df.copy()
        for col, bins in self.bins.items():
            dfc[col] = pd.cut(dfc[col], bins, labels=False, **self.quant_args)
            # 0 will be left for numbers outside the bins, i.e. smaller than
            # the smaller boundary or larger than the largest boundary
            dfc[col] = dfc[col] + 1
            dfc[col] = dfc[col].fillna(0)
            dfc[col] = dfc[col].astype(int)

        return dfc

    def fit_transform(self, df: pd.DataFrame) -> pd.DataFrame:
        return self.fit(df).transform(df)

    def __repr__(self) -> str:
        return f"Quantizer(quantization_setup={self.quantization_setup})"

TextPreprocessor

Bases: BasePreprocessor

Preprocessor to prepare the deeptext input dataset

Parameters:

Name Type Description Default
text_col str

column in the input dataframe containing the texts

required
max_vocab int

Maximum number of tokens in the vocabulary

30000
min_freq int

Minimum frequency for a token to be part of the vocabulary

5
maxlen int

Maximum length of the tokenized sequences

80
pad_first bool

Indicates whether the padding index will be added at the beginning or the end of the sequences

True
pad_idx int

padding index. Fastai's Tokenizer leaves 0 for the 'unknown' token.

1
already_processed Optional[bool]

Boolean indicating if the sequence of elements is already processed or prepared. If this is the case, this Preprocessor will simply tokenize and pad the sequence.

Param aliases: `not_text`. <br/>

This parameter is thought for those cases where the input sequences are already fully processed or are directly not text (e.g. IDs)

False
word_vectors_path Optional[str]

Path to the pretrained word vectors

None
n_cpus Optional[int]

number of CPUs to used during the tokenization process

None
verbose int

Enable verbose output.

1

Attributes:

Name Type Description
vocab Vocab

an instance of pytorch_widedeep.utils.fastai_transforms.Vocab

embedding_matrix ndarray

Array with the pretrained embeddings

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import TextPreprocessor
>>> df_train = pd.DataFrame({'text_column': ["life is like a box of chocolates",
... "You never know what you're gonna get"]})
>>> text_preprocessor = TextPreprocessor(text_col='text_column', max_vocab=25, min_freq=1, maxlen=10)
>>> text_preprocessor.fit_transform(df_train)
The vocabulary contains 24 tokens
array([[ 1,  1,  1,  1, 10, 11, 12, 13, 14, 15],
       [ 5,  9, 16, 17, 18,  9, 19, 20, 21, 22]], dtype=int32)
>>> df_te = pd.DataFrame({'text_column': ['you never know what is in the box']})
>>> text_preprocessor.transform(df_te)
array([[ 1,  1,  9, 16, 17, 18, 11,  0,  0, 13]], dtype=int32)
Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
class TextPreprocessor(BasePreprocessor):
    r"""Preprocessor to prepare the ``deeptext`` input dataset

    Parameters
    ----------
    text_col: str
        column in the input dataframe containing the texts
    max_vocab: int, default=30000
        Maximum number of tokens in the vocabulary
    min_freq: int, default=5
        Minimum frequency for a token to be part of the vocabulary
    maxlen: int, default=80
        Maximum length of the tokenized sequences
    pad_first: bool,  default = True
        Indicates whether the padding index will be added at the beginning or the
        end of the sequences
    pad_idx: int, default = 1
        padding index. Fastai's Tokenizer leaves 0 for the 'unknown' token.
    already_processed: bool, Optional, default = False
        Boolean indicating if the sequence of elements is already processed or
        prepared. If this is the case, this Preprocessor will simply tokenize
        and pad the sequence. <br/>

            Param aliases: `not_text`. <br/>

        This parameter is thought for those cases where the input sequences
        are already fully processed or are directly not text (e.g. IDs)
    word_vectors_path: str, Optional
        Path to the pretrained word vectors
    n_cpus: int, Optional, default = None
        number of CPUs to used during the tokenization process
    verbose: int, default 1
        Enable verbose output.

    Attributes
    ----------
    vocab: Vocab
        an instance of `pytorch_widedeep.utils.fastai_transforms.Vocab`
    embedding_matrix: np.ndarray
        Array with the pretrained embeddings

    Examples
    ---------
    >>> import pandas as pd
    >>> from pytorch_widedeep.preprocessing import TextPreprocessor
    >>> df_train = pd.DataFrame({'text_column': ["life is like a box of chocolates",
    ... "You never know what you're gonna get"]})
    >>> text_preprocessor = TextPreprocessor(text_col='text_column', max_vocab=25, min_freq=1, maxlen=10)
    >>> text_preprocessor.fit_transform(df_train)
    The vocabulary contains 24 tokens
    array([[ 1,  1,  1,  1, 10, 11, 12, 13, 14, 15],
           [ 5,  9, 16, 17, 18,  9, 19, 20, 21, 22]], dtype=int32)
    >>> df_te = pd.DataFrame({'text_column': ['you never know what is in the box']})
    >>> text_preprocessor.transform(df_te)
    array([[ 1,  1,  9, 16, 17, 18, 11,  0,  0, 13]], dtype=int32)
    """

    @alias("already_processed", ["not_text"])
    def __init__(
        self,
        text_col: str,
        max_vocab: int = 30000,
        min_freq: int = 5,
        maxlen: int = 80,
        pad_first: bool = True,
        pad_idx: int = 1,
        already_processed: Optional[bool] = False,
        word_vectors_path: Optional[str] = None,
        n_cpus: Optional[int] = None,
        verbose: int = 1,
    ):
        super(TextPreprocessor, self).__init__()

        self.text_col = text_col
        self.max_vocab = max_vocab
        self.min_freq = min_freq
        self.maxlen = maxlen
        self.pad_first = pad_first
        self.pad_idx = pad_idx
        self.already_processed = already_processed
        self.word_vectors_path = word_vectors_path
        self.verbose = verbose
        self.n_cpus = n_cpus if n_cpus is not None else os.cpu_count()

        self.is_fitted = False

    def fit(self, df: pd.DataFrame) -> BasePreprocessor:
        """Builds the vocabulary

        Parameters
        ----------
        df: pd.DataFrame
            Input pandas dataframe

        Returns
        -------
        TextPreprocessor
            `TextPreprocessor` fitted object
        """
        texts = self._read_texts(df)

        tokens = get_texts(texts, self.already_processed, self.n_cpus)

        self.vocab: TVocab = Vocab(
            max_vocab=self.max_vocab,
            min_freq=self.min_freq,
            pad_idx=self.pad_idx,
        ).fit(
            tokens,
        )

        if self.verbose:
            print("The vocabulary contains {} tokens".format(len(self.vocab.stoi)))
        if self.word_vectors_path is not None:
            self.embedding_matrix = build_embeddings_matrix(
                self.vocab, self.word_vectors_path, self.min_freq
            )

        self.is_fitted = True

        return self

    def transform(self, df: pd.DataFrame) -> np.ndarray:
        """Returns the padded, _'numericalised'_ sequences

        Parameters
        ----------
        df: pd.DataFrame
            Input pandas dataframe

        Returns
        -------
        np.ndarray
            Padded, _'numericalised'_ sequences
        """
        check_is_fitted(self, attributes=["vocab"])
        texts = self._read_texts(df)
        tokens = get_texts(texts, self.already_processed, self.n_cpus)
        return self._pad_sequences(tokens)

    def transform_sample(self, text: str) -> np.ndarray:
        """Returns the padded, _'numericalised'_ sequence

        Parameters
        ----------
        text: str
            text to be tokenized and padded

        Returns
        -------
        np.ndarray
            Padded, _'numericalised'_ sequence
        """
        check_is_fitted(self, attributes=["vocab"])
        tokens = get_texts([text], self.already_processed, self.n_cpus)
        return self._pad_sequences(tokens)[0]

    def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
        """Combines `fit` and `transform`

        Parameters
        ----------
        df: pd.DataFrame
            Input pandas dataframe

        Returns
        -------
        np.ndarray
            Padded, _'numericalised'_ sequences
        """
        return self.fit(df).transform(df)

    def inverse_transform(self, padded_seq: np.ndarray) -> pd.DataFrame:
        """Returns the original text plus the added 'special' tokens

        Parameters
        ----------
        padded_seq: np.ndarray
            array with the output of the `transform` method

        Returns
        -------
        pd.DataFrame
            Pandas dataframe with the original text plus the added 'special' tokens
        """
        texts = [self.vocab.inverse_transform(num) for num in padded_seq]
        return pd.DataFrame({self.text_col: texts})

    def _pad_sequences(self, tokens: List[List[str]]) -> np.ndarray:
        sequences = [self.vocab.transform(t) for t in tokens]
        padded_seq = np.array(
            [
                pad_sequences(
                    s,
                    maxlen=self.maxlen,
                    pad_first=self.pad_first,
                    pad_idx=self.pad_idx,
                )
                for s in sequences
            ]
        )
        return padded_seq

    def _read_texts(
        self, df: pd.DataFrame, root_dir: Optional[str] = None
    ) -> List[str]:
        if root_dir is not None:
            if not os.path.exists(root_dir):
                raise ValueError(
                    "root_dir does not exist. Please create it before fitting the preprocessor"
                )
            texts_fnames = df[self.text_col].tolist()
            texts: List[str] = []
            for texts_fname in texts_fnames:
                with open(os.path.join(root_dir, texts_fname), "r") as f:
                    texts.append(f.read().replace("\n", ""))
        else:
            texts = df[self.text_col].tolist()

        return texts

    def _load_vocab(self, vocab: TVocab) -> None:
        self.vocab = vocab

    def __repr__(self) -> str:
        list_of_params: List[str] = ["text_col={text_col}"]
        list_of_params.append("max_vocab={max_vocab}")
        list_of_params.append("min_freq={min_freq}")
        list_of_params.append("maxlen={maxlen}")
        list_of_params.append("pad_first={pad_first}")
        list_of_params.append("pad_idx={pad_idx}")
        list_of_params.append("already_processed={already_processed}")
        if self.word_vectors_path is not None:
            list_of_params.append("word_vectors_path={word_vectors_path}")
        if self.n_cpus is not None:
            list_of_params.append("n_cpus={n_cpus}")
        if self.verbose is not None:
            list_of_params.append("verbose={verbose}")
        all_params = ", ".join(list_of_params)
        return f"TextPreprocessor({all_params.format(**self.__dict__)})"

fit

fit(df)

Builds the vocabulary

Parameters:

Name Type Description Default
df DataFrame

Input pandas dataframe

required

Returns:

Type Description
TextPreprocessor

TextPreprocessor fitted object

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def fit(self, df: pd.DataFrame) -> BasePreprocessor:
    """Builds the vocabulary

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    TextPreprocessor
        `TextPreprocessor` fitted object
    """
    texts = self._read_texts(df)

    tokens = get_texts(texts, self.already_processed, self.n_cpus)

    self.vocab: TVocab = Vocab(
        max_vocab=self.max_vocab,
        min_freq=self.min_freq,
        pad_idx=self.pad_idx,
    ).fit(
        tokens,
    )

    if self.verbose:
        print("The vocabulary contains {} tokens".format(len(self.vocab.stoi)))
    if self.word_vectors_path is not None:
        self.embedding_matrix = build_embeddings_matrix(
            self.vocab, self.word_vectors_path, self.min_freq
        )

    self.is_fitted = True

    return self

transform

transform(df)

Returns the padded, 'numericalised' sequences

Parameters:

Name Type Description Default
df DataFrame

Input pandas dataframe

required

Returns:

Type Description
ndarray

Padded, 'numericalised' sequences

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
def transform(self, df: pd.DataFrame) -> np.ndarray:
    """Returns the padded, _'numericalised'_ sequences

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        Padded, _'numericalised'_ sequences
    """
    check_is_fitted(self, attributes=["vocab"])
    texts = self._read_texts(df)
    tokens = get_texts(texts, self.already_processed, self.n_cpus)
    return self._pad_sequences(tokens)

transform_sample

transform_sample(text)

Returns the padded, 'numericalised' sequence

Parameters:

Name Type Description Default
text str

text to be tokenized and padded

required

Returns:

Type Description
ndarray

Padded, 'numericalised' sequence

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
def transform_sample(self, text: str) -> np.ndarray:
    """Returns the padded, _'numericalised'_ sequence

    Parameters
    ----------
    text: str
        text to be tokenized and padded

    Returns
    -------
    np.ndarray
        Padded, _'numericalised'_ sequence
    """
    check_is_fitted(self, attributes=["vocab"])
    tokens = get_texts([text], self.already_processed, self.n_cpus)
    return self._pad_sequences(tokens)[0]

fit_transform

fit_transform(df)

Combines fit and transform

Parameters:

Name Type Description Default
df DataFrame

Input pandas dataframe

required

Returns:

Type Description
ndarray

Padded, 'numericalised' sequences

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
179
180
181
182
183
184
185
186
187
188
189
190
191
192
def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        Padded, _'numericalised'_ sequences
    """
    return self.fit(df).transform(df)

inverse_transform

inverse_transform(padded_seq)

Returns the original text plus the added 'special' tokens

Parameters:

Name Type Description Default
padded_seq ndarray

array with the output of the transform method

required

Returns:

Type Description
DataFrame

Pandas dataframe with the original text plus the added 'special' tokens

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
def inverse_transform(self, padded_seq: np.ndarray) -> pd.DataFrame:
    """Returns the original text plus the added 'special' tokens

    Parameters
    ----------
    padded_seq: np.ndarray
        array with the output of the `transform` method

    Returns
    -------
    pd.DataFrame
        Pandas dataframe with the original text plus the added 'special' tokens
    """
    texts = [self.vocab.inverse_transform(num) for num in padded_seq]
    return pd.DataFrame({self.text_col: texts})

HFPreprocessor

Bases: BasePreprocessor

Text processor to prepare the deeptext input dataset that is a wrapper around HuggingFace's tokenizers.

Following the main phylosophy of the pytorch-widedeep library, this class is designed to be as flexible as possible. Therefore, it is coded so that the user can use it as one would use any HuggingFace tokenizers, or following the API call 'protocol' of the rest of the library.

Parameters:

Name Type Description Default
model_name str

The model name from the transformers library e.g. 'bert-base-uncased'. Currently supported models are those from the families: BERT, RoBERTa, DistilBERT, ALBERT and ELECTRA.

required
use_fast_tokenizer bool

Whether to use the fast tokenizer from HuggingFace or not

False
text_col Optional[str]

The column in the input dataframe containing the text data. If this tokenizer is used via the fit and transform methods, this argument is mandatory. If the tokenizer is used via the encode method, this argument is not needed since the input text is passed directly to the encode method.

None
num_workers Optional[int]

Number of workers to use when preprocessing the text data. If not None, and use_fast_tokenizer is False, the text data will be preprocessed in parallel using the number of workers specified. If use_fast_tokenizer is True, this argument is ignored.

None
preprocessing_rules Optional[List[Callable[[str], str]]]

A list of functions to be applied to the text data before encoding. This can be useful to clean the text data before encoding. For example, removing html tags, special characters, etc.

None
tokenizer_params Optional[Dict[str, Any]]

Additional parameters to be passed to the HuggingFace's PreTrainedTokenizer. Parameters to the PreTrainedTokenizer can also be passed via the **kwargs argument

None
encode_params Optional[Dict[str, Any]]

Additional parameters to be passed to the batch_encode_plus method of the HuggingFace's PreTrainedTokenizer. If the fit and transform methods are used, the encode_params dict parameter is mandatory. If the encode method is used, this parameter is not needed since the input text is passed directly to the encode method.

None
**kwargs

Additional kwargs to be passed to the model, in particular to the PreTrainedTokenizer class.

{}

Attributes:

Name Type Description
is_fitted bool

Boolean indicating if the preprocessor has been fitted. This is a HuggingFacea tokenizer, so it is always considered fitted and this attribute is manually set to True internally. This parameter exists for consistency with the rest of the library and because is needed for some functionality in the library.

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import HFPreprocessor
>>> df = pd.DataFrame({"text": ["this is the first text", "this is the second text"]})
>>> hf_processor_1 = HFPreprocessor(model_name="bert-base-uncased", text_col="text")
>>> X_text_1 = hf_processor_1.fit_transform(df)
>>> texts = ["this is a new text", "this is another text"]
>>> hf_processor_2 = HFPreprocessor(model_name="bert-base-uncased")
>>> X_text_2 = hf_processor_2.encode(texts, max_length=10, padding="max_length", truncation=True)
Source code in pytorch_widedeep/preprocessing/hf_preprocessor.py
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
class HFPreprocessor(BasePreprocessor):
    """Text processor to prepare the ``deeptext`` input dataset that is a
    wrapper around HuggingFace's tokenizers.

    Following the main phylosophy of the `pytorch-widedeep` library, this
    class is designed to be as flexible as possible. Therefore, it is coded
    so that the user can use it as one would use any HuggingFace tokenizers,
    or following the API call 'protocol' of the rest of the library.

    Parameters
    ----------
    model_name: str
        The model name from the transformers library e.g. _'bert-base-uncased'_.
        Currently supported models are those from the families: BERT, RoBERTa,
        DistilBERT, ALBERT and ELECTRA.
    use_fast_tokenizer: bool, default = False
        Whether to use the fast tokenizer from HuggingFace or not
    text_col: Optional[str], default = None
        The column in the input dataframe containing the text data. If this
        tokenizer is used via the `fit` and `transform` methods, this
        argument is mandatory. If the tokenizer is used via the `encode`
        method, this argument is not needed since the input text is passed
        directly to the `encode` method.
    num_workers: Optional[int], default = None
        Number of workers to use when preprocessing the text data. If not
        None, and `use_fast_tokenizer` is False, the text data will be
        preprocessed in parallel using the number of workers specified. If
        `use_fast_tokenizer` is True, this argument is ignored.
    preprocessing_rules: Optional[List[Callable[[str], str]]], default = None
        A list of functions to be applied to the text data before encoding.
        This can be useful to clean the text data before encoding. For
        example, removing html tags, special characters, etc.
    tokenizer_params: Optional[Dict[str, Any]], default = None
        Additional parameters to be passed to the HuggingFace's
        `PreTrainedTokenizer`. Parameters to the `PreTrainedTokenizer`
        can also be passed via the `**kwargs` argument
    encode_params: Optional[Dict[str, Any]], default = None
        Additional parameters to be passed to the `batch_encode_plus` method
        of the HuggingFace's `PreTrainedTokenizer`. If the `fit` and `transform`
        methods are used, the `encode_params` dict parameter is mandatory. If
        the `encode` method is used, this parameter is not needed since the
        input text is passed directly to the `encode` method.
    **kwargs
        Additional kwargs to be passed to the model, in particular to the
        `PreTrainedTokenizer` class.

    Attributes
    ----------
    is_fitted: bool
        Boolean indicating if the preprocessor has been fitted. This is a
        HuggingFacea tokenizer, so it is always considered fitted and this
        attribute is manually set to True internally. This parameter exists
        for consistency with the rest of the library and because is needed
        for some functionality in the library.

    Examples
    --------
    >>> import pandas as pd
    >>> from pytorch_widedeep.preprocessing import HFPreprocessor
    >>> df = pd.DataFrame({"text": ["this is the first text", "this is the second text"]})
    >>> hf_processor_1 = HFPreprocessor(model_name="bert-base-uncased", text_col="text")
    >>> X_text_1 = hf_processor_1.fit_transform(df)
    >>> texts = ["this is a new text", "this is another text"]
    >>> hf_processor_2 = HFPreprocessor(model_name="bert-base-uncased")
    >>> X_text_2 = hf_processor_2.encode(texts, max_length=10, padding="max_length", truncation=True)
    """

    def __init__(
        self,
        model_name: str,
        *,
        use_fast_tokenizer: bool = False,
        text_col: Optional[str] = None,
        root_dir: Optional[str] = None,
        num_workers: Optional[int] = None,
        preprocessing_rules: Optional[List[Callable[[str], str]]] = None,
        tokenizer_params: Optional[Dict[str, Any]] = None,
        encode_params: Optional[Dict[str, Any]] = None,
        **kwargs,
    ):
        self.model_name = model_name
        self.use_fast_tokenizer = use_fast_tokenizer
        self.text_col = text_col
        self.root_dir = root_dir
        self.num_workers = num_workers
        self.preprocessing_rules = preprocessing_rules
        self.tokenizer_params = tokenizer_params if tokenizer_params is not None else {}
        self.encode_params = encode_params if encode_params is not None else {}

        self._multiprocessing = (
            num_workers is not None and num_workers > 1 and not use_fast_tokenizer
        )

        if kwargs:
            self.tokenizer_params.update(kwargs)

        self.tokenizer = get_tokenizer(
            model_name=self.model_name,
            use_fast_tokenizer=self.use_fast_tokenizer,
            **self.tokenizer_params,
        )

        # A HuggingFace tokenizer is already trained, since we need this
        # attribute elsewhere in the library, we simply set it to True
        self.is_fitted = True

    def encode(self, texts: List[str], **kwargs) -> npt.NDArray[np.int64]:
        """
        Encodes a list of texts. The method is a wrapper around the
        `batch_encode_plus` method of the HuggingFace's tokenizer.

        if 'use_fast_tokenizer' is True, the method will use the `batch_encode_plus`

        Parameters
        ----------
        texts: List[str]
            List of texts to be encoded
        **kwargs
            Additional parameters to be passed to the `batch_encode_plus` method
            of the HuggingFace's tokenizer. If the 'encode_params' dict was passed
            when instantiating the class, that dictionaly will be updated with
            the kwargs passed here.

        Returns
        -------
        np.array
            The encoded texts
        """
        if kwargs:
            self.encode_params.update(kwargs)

        if self.preprocessing_rules:
            if self._multiprocessing:
                texts = self._process_text_parallel(texts)
            else:
                texts = [self._preprocess_text(text) for text in texts]

        if self._multiprocessing:
            input_ids = self._encode_paralell(texts, **self.encode_params)
        else:
            encoded_texts = self.tokenizer.batch_encode_plus(
                texts,
                **self.encode_params,
            )
            input_ids = encoded_texts.get("input_ids")

        self.is_fitted = True

        try:
            output = np.array(input_ids)
        except ValueError:
            warnings.warn(
                "Padding and Truncating parameters were not passed and all input arrays "
                "do not have the same shape. Padding to the longest sequence. "
                "Padding will be done with the index of the pad token for the model",
                UserWarning,
            )
            max_len = max([len(ids) for ids in input_ids])
            output = np.array(
                [
                    np.pad(ids, (self.tokenizer.pad_token_id, max_len - len(ids)))
                    for ids in input_ids
                ]
            )

        return output

    def decode(
        self, input_ids: npt.NDArray[np.int64], skip_special_tokens: bool
    ) -> List[str]:
        """
        Decodes a list of input_ids. The method is a wrapper around the
        `convert_ids_to_tokens` and `convert_tokens_to_string` methods of the
        HuggingFace's tokenizer.

        Parameters
        ----------
        input_ids: npt.NDArray[np.int64]
            The input_ids to be decoded
        skip_special_tokens: bool
            Whether to skip the special tokens or not

        Returns
        -------
        List[str]
            The decoded texts
        """
        texts = [
            self.tokenizer.convert_tokens_to_string(
                self.tokenizer.convert_ids_to_tokens(input_ids[i], skip_special_tokens)
            )
            for i in range(input_ids.shape[0])
        ]
        return texts

    def fit(self, df: pd.DataFrame) -> "HFPreprocessor":
        """
        This method is included for consistency with the rest of the library
        in general and with the `BasePreprocessor` in particular. HuggingFace's
        tokenizers and models are already trained. Therefore, the 'fit' method
        here does nothing other than checking that the 'text_col' parameter is
        not `None`.

        Parameters
        ----------
        df: pd.DataFrame
            The dataframe containing the text data in the column specified by
            the 'text_col' parameter
        """
        if self.text_col is None:
            raise ValueError(
                "'text_col' is None. Please specify the column name containing the text data"
                " if you want to use the 'fit' method"
            )
        return self

    def transform(self, df: pd.DataFrame) -> npt.NDArray[np.int64]:
        """
        Encodes the text data in the input dataframe. This method simply
        calls the `encode` method under the hood. Similar to the `fit` method,
        this method is included for consistency with the rest of the library
        in general and with the `BasePreprocessor` in particular.

        Parameters
        ----------
        df: pd.DataFrame
            The dataframe containing the text data in the column specified by
            the 'text_col' parameter

        Returns
        -------
        np.array
            The encoded texts
        """
        if self.text_col is None:
            raise ValueError(
                "'text_col' is None. Please specify the column name containing the text data"
                " if you want to use the 'fit' method"
            )

        texts = self._read_texts(df, self.root_dir)

        return self.encode(texts)

    def transform_sample(self, text: str) -> npt.NDArray[np.int64]:
        """
        Encodes a single text sample.

        Parameters
        ----------
        text: str
            The text sample to be encoded

        Returns
        -------
        np.array
            The encoded text
        """

        if not self.is_fitted:
            raise ValueError(
                "The `encode` (or `fit`) method must be called before calling `transform_sample`"
            )
        return self.encode([text])[0]

    def fit_transform(self, df: pd.DataFrame) -> npt.NDArray[np.int64]:
        """
        Encodes the text data in the input dataframe.

        Parameters
        ----------
        df: pd.DataFrame
            The dataframe containing the text data in the column specified by
            the 'text_col' parameter

        Returns
        -------
        np.array
            The encoded texts
        """
        return self.fit(df).transform(df)

    def inverse_transform(
        self, input_ids: npt.NDArray[np.int64], skip_special_tokens: bool
    ) -> List[str]:
        """
        Decodes a list of input_ids. The method simply calls the `decode` method
        under the hood.

        Parameters
        ----------
        input_ids: npt.NDArray[np.int64]
            The input_ids to be decoded
        skip_special_tokens: bool
            Whether to skip the special tokens or not

        Returns
        -------
        List[str]
            The decoded texts
        """
        return self.decode(input_ids, skip_special_tokens)

    def _process_text_parallel(self, texts: List[str]) -> List[str]:
        num_processes = (
            self.num_workers if self.num_workers is not None else os.cpu_count()
        )
        with ProcessPoolExecutor(max_workers=num_processes) as executor:
            processed_texts = list(executor.map(self._preprocess_text, texts))
        return processed_texts

    def _preprocess_text(self, text: str) -> str:
        for rule in self.preprocessing_rules:
            text = rule(text)
        return text

    def _encode_paralell(self, texts: List[str], **kwargs) -> List[List[int]]:
        num_processes = (
            self.num_workers if self.num_workers is not None else os.cpu_count()
        )
        ptokenizer = partial(self.tokenizer.encode, **kwargs)
        with ProcessPoolExecutor(max_workers=num_processes) as executor:
            encoded_texts = list(executor.map(ptokenizer, texts))
        return encoded_texts

    def _read_texts(
        self, df: pd.DataFrame, root_dir: Optional[str] = None
    ) -> List[str]:
        if root_dir is not None:
            if not os.path.exists(root_dir):
                raise ValueError(
                    "root_dir does not exist. Please create it before fitting the preprocessor"
                )
            texts_fnames = df[self.text_col].tolist()
            texts: List[str] = []
            for texts_fname in texts_fnames:
                with open(os.path.join(root_dir, texts_fname), "r") as f:
                    texts.append(f.read().replace("\n", ""))
        else:
            texts = df[self.text_col].tolist()

        return texts

    def __repr__(self):
        return (
            f"HFPreprocessor(text_col={self.text_col}, model_name={self.model_name}, "
            f"use_fast_tokenizer={self.use_fast_tokenizer}, num_workers={self.num_workers}, "
            f"preprocessing_rules={self.preprocessing_rules}, tokenizer_params={self.tokenizer_params}, "
            f"encode_params={self.encode_params})"
        )

encode

encode(texts, **kwargs)

Encodes a list of texts. The method is a wrapper around the batch_encode_plus method of the HuggingFace's tokenizer.

if 'use_fast_tokenizer' is True, the method will use the batch_encode_plus

Parameters:

Name Type Description Default
texts List[str]

List of texts to be encoded

required
**kwargs

Additional parameters to be passed to the batch_encode_plus method of the HuggingFace's tokenizer. If the 'encode_params' dict was passed when instantiating the class, that dictionaly will be updated with the kwargs passed here.

{}

Returns:

Type Description
array

The encoded texts

Source code in pytorch_widedeep/preprocessing/hf_preprocessor.py
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
def encode(self, texts: List[str], **kwargs) -> npt.NDArray[np.int64]:
    """
    Encodes a list of texts. The method is a wrapper around the
    `batch_encode_plus` method of the HuggingFace's tokenizer.

    if 'use_fast_tokenizer' is True, the method will use the `batch_encode_plus`

    Parameters
    ----------
    texts: List[str]
        List of texts to be encoded
    **kwargs
        Additional parameters to be passed to the `batch_encode_plus` method
        of the HuggingFace's tokenizer. If the 'encode_params' dict was passed
        when instantiating the class, that dictionaly will be updated with
        the kwargs passed here.

    Returns
    -------
    np.array
        The encoded texts
    """
    if kwargs:
        self.encode_params.update(kwargs)

    if self.preprocessing_rules:
        if self._multiprocessing:
            texts = self._process_text_parallel(texts)
        else:
            texts = [self._preprocess_text(text) for text in texts]

    if self._multiprocessing:
        input_ids = self._encode_paralell(texts, **self.encode_params)
    else:
        encoded_texts = self.tokenizer.batch_encode_plus(
            texts,
            **self.encode_params,
        )
        input_ids = encoded_texts.get("input_ids")

    self.is_fitted = True

    try:
        output = np.array(input_ids)
    except ValueError:
        warnings.warn(
            "Padding and Truncating parameters were not passed and all input arrays "
            "do not have the same shape. Padding to the longest sequence. "
            "Padding will be done with the index of the pad token for the model",
            UserWarning,
        )
        max_len = max([len(ids) for ids in input_ids])
        output = np.array(
            [
                np.pad(ids, (self.tokenizer.pad_token_id, max_len - len(ids)))
                for ids in input_ids
            ]
        )

    return output

decode

decode(input_ids, skip_special_tokens)

Decodes a list of input_ids. The method is a wrapper around the convert_ids_to_tokens and convert_tokens_to_string methods of the HuggingFace's tokenizer.

Parameters:

Name Type Description Default
input_ids NDArray[int64]

The input_ids to be decoded

required
skip_special_tokens bool

Whether to skip the special tokens or not

required

Returns:

Type Description
List[str]

The decoded texts

Source code in pytorch_widedeep/preprocessing/hf_preprocessor.py
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def decode(
    self, input_ids: npt.NDArray[np.int64], skip_special_tokens: bool
) -> List[str]:
    """
    Decodes a list of input_ids. The method is a wrapper around the
    `convert_ids_to_tokens` and `convert_tokens_to_string` methods of the
    HuggingFace's tokenizer.

    Parameters
    ----------
    input_ids: npt.NDArray[np.int64]
        The input_ids to be decoded
    skip_special_tokens: bool
        Whether to skip the special tokens or not

    Returns
    -------
    List[str]
        The decoded texts
    """
    texts = [
        self.tokenizer.convert_tokens_to_string(
            self.tokenizer.convert_ids_to_tokens(input_ids[i], skip_special_tokens)
        )
        for i in range(input_ids.shape[0])
    ]
    return texts

fit

fit(df)

This method is included for consistency with the rest of the library in general and with the BasePreprocessor in particular. HuggingFace's tokenizers and models are already trained. Therefore, the 'fit' method here does nothing other than checking that the 'text_col' parameter is not None.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing the text data in the column specified by the 'text_col' parameter

required
Source code in pytorch_widedeep/preprocessing/hf_preprocessor.py
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
def fit(self, df: pd.DataFrame) -> "HFPreprocessor":
    """
    This method is included for consistency with the rest of the library
    in general and with the `BasePreprocessor` in particular. HuggingFace's
    tokenizers and models are already trained. Therefore, the 'fit' method
    here does nothing other than checking that the 'text_col' parameter is
    not `None`.

    Parameters
    ----------
    df: pd.DataFrame
        The dataframe containing the text data in the column specified by
        the 'text_col' parameter
    """
    if self.text_col is None:
        raise ValueError(
            "'text_col' is None. Please specify the column name containing the text data"
            " if you want to use the 'fit' method"
        )
    return self

transform

transform(df)

Encodes the text data in the input dataframe. This method simply calls the encode method under the hood. Similar to the fit method, this method is included for consistency with the rest of the library in general and with the BasePreprocessor in particular.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing the text data in the column specified by the 'text_col' parameter

required

Returns:

Type Description
array

The encoded texts

Source code in pytorch_widedeep/preprocessing/hf_preprocessor.py
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
def transform(self, df: pd.DataFrame) -> npt.NDArray[np.int64]:
    """
    Encodes the text data in the input dataframe. This method simply
    calls the `encode` method under the hood. Similar to the `fit` method,
    this method is included for consistency with the rest of the library
    in general and with the `BasePreprocessor` in particular.

    Parameters
    ----------
    df: pd.DataFrame
        The dataframe containing the text data in the column specified by
        the 'text_col' parameter

    Returns
    -------
    np.array
        The encoded texts
    """
    if self.text_col is None:
        raise ValueError(
            "'text_col' is None. Please specify the column name containing the text data"
            " if you want to use the 'fit' method"
        )

    texts = self._read_texts(df, self.root_dir)

    return self.encode(texts)

transform_sample

transform_sample(text)

Encodes a single text sample.

Parameters:

Name Type Description Default
text str

The text sample to be encoded

required

Returns:

Type Description
array

The encoded text

Source code in pytorch_widedeep/preprocessing/hf_preprocessor.py
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
def transform_sample(self, text: str) -> npt.NDArray[np.int64]:
    """
    Encodes a single text sample.

    Parameters
    ----------
    text: str
        The text sample to be encoded

    Returns
    -------
    np.array
        The encoded text
    """

    if not self.is_fitted:
        raise ValueError(
            "The `encode` (or `fit`) method must be called before calling `transform_sample`"
        )
    return self.encode([text])[0]

fit_transform

fit_transform(df)

Encodes the text data in the input dataframe.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing the text data in the column specified by the 'text_col' parameter

required

Returns:

Type Description
array

The encoded texts

Source code in pytorch_widedeep/preprocessing/hf_preprocessor.py
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
def fit_transform(self, df: pd.DataFrame) -> npt.NDArray[np.int64]:
    """
    Encodes the text data in the input dataframe.

    Parameters
    ----------
    df: pd.DataFrame
        The dataframe containing the text data in the column specified by
        the 'text_col' parameter

    Returns
    -------
    np.array
        The encoded texts
    """
    return self.fit(df).transform(df)

inverse_transform

inverse_transform(input_ids, skip_special_tokens)

Decodes a list of input_ids. The method simply calls the decode method under the hood.

Parameters:

Name Type Description Default
input_ids NDArray[int64]

The input_ids to be decoded

required
skip_special_tokens bool

Whether to skip the special tokens or not

required

Returns:

Type Description
List[str]

The decoded texts

Source code in pytorch_widedeep/preprocessing/hf_preprocessor.py
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
def inverse_transform(
    self, input_ids: npt.NDArray[np.int64], skip_special_tokens: bool
) -> List[str]:
    """
    Decodes a list of input_ids. The method simply calls the `decode` method
    under the hood.

    Parameters
    ----------
    input_ids: npt.NDArray[np.int64]
        The input_ids to be decoded
    skip_special_tokens: bool
        Whether to skip the special tokens or not

    Returns
    -------
    List[str]
        The decoded texts
    """
    return self.decode(input_ids, skip_special_tokens)

ImagePreprocessor

Bases: BasePreprocessor

Preprocessor to prepare the deepimage input dataset.

The Preprocessing consists simply on resizing according to their aspect ratio

Parameters:

Name Type Description Default
img_col str

name of the column with the images filenames

required
img_path str

path to the dicrectory where the images are stored

required
width int

width of the resulting processed image.

224
height int

width of the resulting processed image.

224
verbose int

Enable verbose output.

1

Attributes:

Name Type Description
aap AspectAwarePreprocessor

an instance of pytorch_widedeep.utils.image_utils.AspectAwarePreprocessor

spp SimplePreprocessor

an instance of pytorch_widedeep.utils.image_utils.SimplePreprocessor

normalise_metrics Dict

Dict containing the normalisation metrics of the image dataset, i.e. mean and std for the R, G and B channels

Examples:

>>> import pandas as pd
>>>
>>> from pytorch_widedeep.preprocessing import ImagePreprocessor
>>>
>>> path_to_image1 = 'tests/test_data_utils/images/galaxy1.png'
>>> path_to_image2 = 'tests/test_data_utils/images/galaxy2.png'
>>>
>>> df_train = pd.DataFrame({'images_column': [path_to_image1]})
>>> df_test = pd.DataFrame({'images_column': [path_to_image2]})
>>> img_preprocessor = ImagePreprocessor(img_col='images_column', img_path='.', verbose=0)
>>> resized_images = img_preprocessor.fit_transform(df_train)
>>> new_resized_images = img_preprocessor.transform(df_train)

ℹ️ NOTE: Normalising metrics will only be computed when the fit_transform method is run. Running transform only will not change the computed metrics and running fit only simply instantiates the resizing functions.

Source code in pytorch_widedeep/preprocessing/image_preprocessor.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
class ImagePreprocessor(BasePreprocessor):
    r"""Preprocessor to prepare the ``deepimage`` input dataset.

    The Preprocessing consists simply on resizing according to their
    aspect ratio

    Parameters
    ----------
    img_col: str
        name of the column with the images filenames
    img_path: str
        path to the dicrectory where the images are stored
    width: int, default=224
        width of the resulting processed image.
    height: int, default=224
        width of the resulting processed image.
    verbose: int, default 1
        Enable verbose output.

    Attributes
    ----------
    aap: AspectAwarePreprocessor
        an instance of `pytorch_widedeep.utils.image_utils.AspectAwarePreprocessor`
    spp: SimplePreprocessor
        an instance of `pytorch_widedeep.utils.image_utils.SimplePreprocessor`
    normalise_metrics: Dict
        Dict containing the normalisation metrics of the image dataset, i.e.
        mean and std for the R, G and B channels

    Examples
    --------
    >>> import pandas as pd
    >>>
    >>> from pytorch_widedeep.preprocessing import ImagePreprocessor
    >>>
    >>> path_to_image1 = 'tests/test_data_utils/images/galaxy1.png'
    >>> path_to_image2 = 'tests/test_data_utils/images/galaxy2.png'
    >>>
    >>> df_train = pd.DataFrame({'images_column': [path_to_image1]})
    >>> df_test = pd.DataFrame({'images_column': [path_to_image2]})
    >>> img_preprocessor = ImagePreprocessor(img_col='images_column', img_path='.', verbose=0)
    >>> resized_images = img_preprocessor.fit_transform(df_train)
    >>> new_resized_images = img_preprocessor.transform(df_train)

    :information_source: **NOTE**:
    Normalising metrics will only be computed when the ``fit_transform``
    method is run. Running ``transform`` only will not change the computed
    metrics and running ``fit`` only simply instantiates the resizing
    functions.
    """

    def __init__(
        self,
        img_col: str,
        img_path: str,
        width: int = 224,
        height: int = 224,
        verbose: int = 1,
    ):
        super(ImagePreprocessor, self).__init__()

        self.img_col = img_col
        self.img_path = img_path
        self.width = width
        self.height = height
        self.verbose = verbose

        self.aap = AspectAwarePreprocessor(self.width, self.height)
        self.spp = SimplePreprocessor(self.width, self.height)

        self.compute_normalising_computed = False

    def fit(self, df: pd.DataFrame) -> BasePreprocessor:
        # simply include this method to comply with the BasePreprocessor and
        # for consistency with the rest of the preprocessors
        return self

    def transform_sample(self, img: np.ndarray) -> np.ndarray:
        aspect_r = img.shape[0] / img.shape[1]
        if aspect_r != 1.0:
            resized_img = self.aap.preprocess(img)
        else:
            resized_img = self.spp.preprocess(img)
        return resized_img

    def transform(self, df: pd.DataFrame) -> np.ndarray:
        """Resizes the images to the input height and width.


        Parameters
        ----------
        df: pd.DataFrame
            Input pandas dataframe with the `img_col`

        Returns
        -------
        np.ndarray
            Resized images to the input height and width
        """
        image_list = df[self.img_col].tolist()
        if self.verbose:
            print("Reading Images from {}".format(self.img_path))
        imgs = [cv2.imread("/".join([self.img_path, img])) for img in image_list]

        # finding images with different height and width
        aspect = [(im.shape[0], im.shape[1]) for im in imgs]
        aspect_r = [a[0] / a[1] for a in aspect]
        diff_idx = [i for i, r in enumerate(aspect_r) if r != 1.0]

        if self.verbose:
            print("Resizing")
        resized_imgs = []
        for i, img in tqdm(enumerate(imgs), total=len(imgs), disable=self.verbose != 1):
            if i in diff_idx:
                resized_imgs.append(self.aap.preprocess(img))
            else:
                # if aspect ratio is 1:1, no need for AspectAwarePreprocessor
                resized_imgs.append(self.spp.preprocess(img))

        if not self.compute_normalising_computed:
            if self.verbose:
                print("Computing normalisation metrics")
            # mean and std deviation will only be computed when the fit method
            # is called
            mean_R, mean_G, mean_B = [], [], []
            std_R, std_G, std_B = [], [], []
            for rsz_img in resized_imgs:
                (mean_b, mean_g, mean_r), (std_b, std_g, std_r) = cv2.meanStdDev(
                    rsz_img
                )
                mean_R.append(mean_r)
                mean_G.append(mean_g)
                mean_B.append(mean_b)
                std_R.append(std_r)
                std_G.append(std_g)
                std_B.append(std_b)
            self.normalise_metrics = dict(
                mean={
                    "R": np.mean(mean_R) / 255.0,
                    "G": np.mean(mean_G) / 255.0,
                    "B": np.mean(mean_B) / 255.0,
                },
                std={
                    "R": np.mean(std_R) / 255.0,
                    "G": np.mean(std_G) / 255.0,
                    "B": np.mean(std_B) / 255.0,
                },
            )
            self.compute_normalising_computed = True
        return np.asarray(resized_imgs)

    def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
        """Combines `fit` and `transform`

        Parameters
        ----------
        df: pd.DataFrame
            Input pandas dataframe

        Returns
        -------
        np.ndarray
            Resized images to the input height and width
        """
        return self.fit(df).transform(df)

    def inverse_transform(self, transformed_image):
        raise NotImplementedError(
            "'inverse_transform' method is not implemented for 'ImagePreprocessor'"
        )

    def __repr__(self) -> str:
        list_of_params: List[str] = []
        list_of_params.append("img_col={img_col}")
        list_of_params.append("img_path={img_path}")
        list_of_params.append("width={width}")
        list_of_params.append("height={height}")
        list_of_params.append("verbose={verbose}")
        all_params = ", ".join(list_of_params)
        return f"ImagePreprocessor({all_params.format(**self.__dict__)})"

transform

transform(df)

Resizes the images to the input height and width.

Parameters:

Name Type Description Default
df DataFrame

Input pandas dataframe with the img_col

required

Returns:

Type Description
ndarray

Resized images to the input height and width

Source code in pytorch_widedeep/preprocessing/image_preprocessor.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
def transform(self, df: pd.DataFrame) -> np.ndarray:
    """Resizes the images to the input height and width.


    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe with the `img_col`

    Returns
    -------
    np.ndarray
        Resized images to the input height and width
    """
    image_list = df[self.img_col].tolist()
    if self.verbose:
        print("Reading Images from {}".format(self.img_path))
    imgs = [cv2.imread("/".join([self.img_path, img])) for img in image_list]

    # finding images with different height and width
    aspect = [(im.shape[0], im.shape[1]) for im in imgs]
    aspect_r = [a[0] / a[1] for a in aspect]
    diff_idx = [i for i, r in enumerate(aspect_r) if r != 1.0]

    if self.verbose:
        print("Resizing")
    resized_imgs = []
    for i, img in tqdm(enumerate(imgs), total=len(imgs), disable=self.verbose != 1):
        if i in diff_idx:
            resized_imgs.append(self.aap.preprocess(img))
        else:
            # if aspect ratio is 1:1, no need for AspectAwarePreprocessor
            resized_imgs.append(self.spp.preprocess(img))

    if not self.compute_normalising_computed:
        if self.verbose:
            print("Computing normalisation metrics")
        # mean and std deviation will only be computed when the fit method
        # is called
        mean_R, mean_G, mean_B = [], [], []
        std_R, std_G, std_B = [], [], []
        for rsz_img in resized_imgs:
            (mean_b, mean_g, mean_r), (std_b, std_g, std_r) = cv2.meanStdDev(
                rsz_img
            )
            mean_R.append(mean_r)
            mean_G.append(mean_g)
            mean_B.append(mean_b)
            std_R.append(std_r)
            std_G.append(std_g)
            std_B.append(std_b)
        self.normalise_metrics = dict(
            mean={
                "R": np.mean(mean_R) / 255.0,
                "G": np.mean(mean_G) / 255.0,
                "B": np.mean(mean_B) / 255.0,
            },
            std={
                "R": np.mean(std_R) / 255.0,
                "G": np.mean(std_G) / 255.0,
                "B": np.mean(std_B) / 255.0,
            },
        )
        self.compute_normalising_computed = True
    return np.asarray(resized_imgs)

fit_transform

fit_transform(df)

Combines fit and transform

Parameters:

Name Type Description Default
df DataFrame

Input pandas dataframe

required

Returns:

Type Description
ndarray

Resized images to the input height and width

Source code in pytorch_widedeep/preprocessing/image_preprocessor.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        Resized images to the input height and width
    """
    return self.fit(df).transform(df)

DINPreprocessor

Bases: BasePreprocessor

Preprocessor for Deep Interest Network (DIN) models.

This preprocessor handles the preparation of data for DIN models, including sequence building, label encoding, and handling of various types of input columns (categorical, continuous, and sequential).

Parameters:

Name Type Description Default
user_id_col str

Name of the column containing user IDs.

required
item_embed_col Union[str, Tuple[str, int]]

Name of the column containing item IDs to be embedded, or a tuple of (column_name, embedding_dim). If embedding_dim is not provided, it will be automatically calculated using a rule of thumb specified by embedding_rule. Aliased as item_id_col.

required
target_col str

Name of the column containing the target variable.

required
max_seq_length int

Maximum length of sequences to be created.

required
action_col Optional[str]

Name of the column containing user actions (if applicable), for example a rating, or purchased/not-purchased, etc). This 'action_col' can also be the same as the 'target_col'.

None
other_seq_embed_cols Optional[List[str] | List[Tuple[str, int]]]

List of other columns to be treated as sequences. Each element in the list can be a string with the name of the column or a tuple with the name of the column and the embedding dimension.

None
cat_embed_cols Optional[Union[List[str], List[Tuple[str, int]]]]

This and the following parameters are identical to those used in the TabPreprocessor.
List containing the name of the categorical columns that will be represented by embeddings (e.g. ['education', 'relationship', ...]) or a Tuple with the name and the embedding dimension (e.g.: [ ('education',32), ('relationship',16), ...])

None
continuous_cols Optional[List[str]]

List with the name of the continuous cols

None
quantization_setup Optional[Union[int, Dict[str, Union[int, List[float]]]]]

Continuous columns can be turned into categorical via pd.cut. If quantization_setup is an int, all continuous columns will be quantized using this value as the number of bins. Alternatively, a dictionary where the keys are the column names to quantize and the values are the either integers indicating the number of bins or a list of scalars indicating the bin edges can also be used.

None
cols_to_scale Optional[Union[List[str], str]]

List with the names of the columns that will be standarised via sklearn's StandardScaler. It can also be the string 'all' in which case all the continuous cols will be scaled.

None
auto_embed_dim bool

Boolean indicating whether the embedding dimensions will be automatically defined via rule of thumb. See embedding_rule below.

True
embedding_rule Literal['google', 'fastai_old', 'fastai_new']

If auto_embed_dim=True, this is the choice of embedding rule of thumb. Choices are:

  • fastai_new: \(min(600, round(1.6 imes n_{cat}^{0.56}))\)

  • fastai_old: \(min(50, (n_{cat}//{2})+1)\)

  • google: \(min(600, round(n_{cat}^{0.24}))\)

'fastai_new'
default_embed_dim int

Dimension for the embeddings if the embedding dimension is not provided in the cat_embed_cols parameter and auto_embed_dim is set to False.

16
verbose int

Verbosity level.

1
scale bool

ℹ️ note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
Bool indicating whether or not to scale/standarise continuous cols. It is important to emphasize that all the DL models for tabular data in the library also include the possibility of normalising the input continuous features via a BatchNorm or a LayerNorm.
Param alias: scale_cont_cols.

False
already_standard Optional[List[str]]

ℹ️ note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
List with the name of the continuous cols that do not need to be scaled/standarised.

None

Other Parameters:

Name Type Description
**kwargs

pd.cut and StandardScaler related args

Attributes:

Name Type Description
is_fitted bool

Whether the preprocessor has been fitted.

has_standard_tab_data bool

Whether the data includes standard tabular data.

tab_preprocessor TabPreprocessor

Preprocessor for standard tabular data (if applicable).

din_columns_idx Dict[str, int]

Dictionary mapping column names to their indices in the processed data.

item_le LabelEncoder

Label encoder for item IDs.

n_items int

Number of unique items.

user_behaviour_config Tuple[List[str], int, int]

Configuration for user behavior sequences.

action_le LabelEncoder

Label encoder for action column (if applicable).

n_actions int

Number of unique actions (if applicable).

action_seq_config Tuple[List[str], int]

Configuration for action sequences (if applicable).

other_seq_le LabelEncoder

Label encoder for other sequence columns (if applicable).

n_other_seq_cols Dict[str, int]

Number of unique values in each other sequence column.

other_seq_config List[Tuple[List[str], int, int]]

Configuration for other sequence columns.

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import DINPreprocessor
>>> data = {
...    'user_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
...    'item_id': [101, 102, 103, 101, 103, 104, 102, 103, 104],
...    'timestamp': [1, 2, 3, 1, 2, 3, 1, 2, 3],
...    'category': ['A', 'B', 'A', 'B', 'A', 'C', 'B', 'A', 'C'],
...    'price': [10.5, 15.0, 12.0, 10.5, 12.0, 20.0, 15.0, 12.0, 20.0],
...    'rating': [0, 1, 0, 1, 0, 1, 0, 1, 0]
... }
>>> df = pd.DataFrame(data)
>>> din_preprocessor = DINPreprocessor(
...     user_id_col='user_id',
...     item_embed_col='item_id',
...     target_col='rating',
...     max_seq_length=2,
...     action_col='rating',
...     other_seq_embed_cols=['category'],
...     cat_embed_cols=['user_id'],
...     continuous_cols=['price'],
...     cols_to_scale=['price']
... )
>>> X, y = din_preprocessor.fit_transform(df)
Source code in pytorch_widedeep/preprocessing/din_preprocessor.py
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
class DINPreprocessor(BasePreprocessor):
    """
    Preprocessor for Deep Interest Network (DIN) models.

    This preprocessor handles the preparation of data for DIN models,
    including sequence building, label encoding, and handling of various
    types of input columns (categorical, continuous, and sequential).

    Parameters
    ----------
    user_id_col : str
        Name of the column containing user IDs.
    item_embed_col : Union[str, Tuple[str, int]]
        Name of the column containing item IDs to be embedded, or a tuple of
        (column_name, embedding_dim). If embedding_dim is not provided, it
        will be automatically calculated using a rule of thumb specified by
        `embedding_rule`. Aliased as `item_id_col`.
    target_col : str
        Name of the column containing the target variable.
    max_seq_length : int
        Maximum length of sequences to be created.
    action_col : Optional[str], default=None
        Name of the column containing user actions (if applicable), for
        example a rating, or purchased/not-purchased, etc). This 'action_col'
        can also be the same as the 'target_col'.
    other_seq_embed_cols : Optional[List[str] | List[Tuple[str, int]]], default=None
        List of other columns to be treated as sequences. Each element in the
        list can be a string with the name of the column or a tuple with the
        name of the column and the embedding dimension.

    cat_embed_cols : Optional[Union[List[str], List[Tuple[str, int]]]], default=None
        This and the following parameters are identical to those used in the
        TabPreprocessor.<br>
        List containing the name of the categorical columns that will be
        represented by embeddings (e.g. _['education', 'relationship', ...]_) or
        a Tuple with the name and the embedding dimension (e.g.: _[
        ('education',32), ('relationship',16), ...]_)
    continuous_cols: List, default = None
        List with the name of the continuous cols
    quantization_setup: int or Dict, default = None
        Continuous columns can be turned into categorical via `pd.cut`. If
        `quantization_setup` is an `int`, all continuous columns will be
        quantized using this value as the number of bins. Alternatively, a
        dictionary where the keys are the column names to quantize and the
        values are the either integers indicating the number of bins or a
        list of scalars indicating the bin edges can also be used.
    cols_to_scale: List or str, default = None,
        List with the names of the columns that will be standarised via
        sklearn's `StandardScaler`. It can also be the string `'all'` in
        which case all the continuous cols will be scaled.
    auto_embed_dim: bool, default = True
        Boolean indicating whether the embedding dimensions will be
        automatically defined via rule of thumb. See `embedding_rule`
        below.
    embedding_rule: str, default = 'fastai_new'
        If `auto_embed_dim=True`, this is the choice of embedding rule of
        thumb. Choices are:

        - _fastai_new_: $min(600, round(1.6 \times n_{cat}^{0.56}))$

        - _fastai_old_: $min(50, (n_{cat}//{2})+1)$

        - _google_: $min(600, round(n_{cat}^{0.24}))$
    default_embed_dim: int, default=16
        Dimension for the embeddings if the embedding dimension is not
        provided in the `cat_embed_cols` parameter and `auto_embed_dim` is
        set to `False`.
    verbose : int, default=1
        Verbosity level.
    scale: bool, default = False
        :information_source: **note**: this arg will be removed in upcoming
         releases. Please use `cols_to_scale` instead. <br/> Bool indicating
         whether or not to scale/standarise continuous cols. It is important
         to emphasize that all the DL models for tabular data in the library
         also include the possibility of normalising the input continuous
         features via a `BatchNorm` or a `LayerNorm`. <br/> Param alias:
         `scale_cont_cols`.
    already_standard: List, default = None
        :information_source: **note**: this arg will be removed in upcoming
         releases. Please use `cols_to_scale` instead. <br/> List with the
         name of the continuous cols that do not need to be
         scaled/standarised.

    Other Parameters
    ----------------
    **kwargs: dict
        `pd.cut` and `StandardScaler` related args

    Attributes
    ----------
    is_fitted : bool
        Whether the preprocessor has been fitted.
    has_standard_tab_data : bool
        Whether the data includes standard tabular data.
    tab_preprocessor : TabPreprocessor
        Preprocessor for standard tabular data (if applicable).
    din_columns_idx : Dict[str, int]
        Dictionary mapping column names to their indices in the processed data.
    item_le : LabelEncoder
        Label encoder for item IDs.
    n_items : int
        Number of unique items.
    user_behaviour_config : Tuple[List[str], int, int]
        Configuration for user behavior sequences.
    action_le : LabelEncoder
        Label encoder for action column (if applicable).
    n_actions : int
        Number of unique actions (if applicable).
    action_seq_config : Tuple[List[str], int]
        Configuration for action sequences (if applicable).
    other_seq_le : LabelEncoder
        Label encoder for other sequence columns (if applicable).
    n_other_seq_cols : Dict[str, int]
        Number of unique values in each other sequence column.
    other_seq_config : List[Tuple[List[str], int, int]]
        Configuration for other sequence columns.

    Examples
    --------
    >>> import pandas as pd
    >>> from pytorch_widedeep.preprocessing import DINPreprocessor
    >>> data = {
    ...    'user_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    ...    'item_id': [101, 102, 103, 101, 103, 104, 102, 103, 104],
    ...    'timestamp': [1, 2, 3, 1, 2, 3, 1, 2, 3],
    ...    'category': ['A', 'B', 'A', 'B', 'A', 'C', 'B', 'A', 'C'],
    ...    'price': [10.5, 15.0, 12.0, 10.5, 12.0, 20.0, 15.0, 12.0, 20.0],
    ...    'rating': [0, 1, 0, 1, 0, 1, 0, 1, 0]
    ... }
    >>> df = pd.DataFrame(data)
    >>> din_preprocessor = DINPreprocessor(
    ...     user_id_col='user_id',
    ...     item_embed_col='item_id',
    ...     target_col='rating',
    ...     max_seq_length=2,
    ...     action_col='rating',
    ...     other_seq_embed_cols=['category'],
    ...     cat_embed_cols=['user_id'],
    ...     continuous_cols=['price'],
    ...     cols_to_scale=['price']
    ... )
    >>> X, y = din_preprocessor.fit_transform(df)
    """

    @alias("item_embed_col", ["item_id_col"])
    def __init__(
        self,
        *,
        user_id_col: str,
        item_embed_col: Union[str, Tuple[str, int]],
        target_col: str,
        max_seq_length: int,
        action_col: Optional[str] = None,
        other_seq_embed_cols: Optional[Union[List[str], List[Tuple[str, int]]]] = None,
        cat_embed_cols: Optional[Union[List[str], List[Tuple[str, int]]]] = None,
        continuous_cols: Optional[List[str]] = None,
        quantization_setup: Optional[
            Union[int, Dict[str, Union[int, List[float]]]]
        ] = None,
        cols_to_scale: Optional[Union[List[str], str]] = None,
        auto_embed_dim: bool = True,
        embedding_rule: Literal["google", "fastai_old", "fastai_new"] = "fastai_new",
        default_embed_dim: int = 16,
        verbose: int = 1,
        scale: bool = False,
        already_standard: Optional[List[str]] = None,
        **kwargs,
    ):
        self.user_id_col = user_id_col
        self.item_embed_col = item_embed_col
        self.max_seq_length = max_seq_length
        self.target_col = target_col if target_col is not None else "target"
        self.action_col = action_col
        self.other_seq_embed_cols = other_seq_embed_cols
        self.cat_embed_cols = cat_embed_cols
        self.continuous_cols = continuous_cols
        self.quantization_setup = quantization_setup
        self.cols_to_scale = cols_to_scale
        self.auto_embed_dim = auto_embed_dim
        self.embedding_rule = embedding_rule
        self.default_embed_dim = default_embed_dim
        self.verbose = verbose
        self.scale = scale
        self.already_standard = already_standard
        self.kwargs = kwargs

        self.has_standard_tab_data = bool(self.cat_embed_cols or self.continuous_cols)
        if self.has_standard_tab_data:
            self.tab_preprocessor = TabPreprocessor(
                cat_embed_cols=self.cat_embed_cols,
                continuous_cols=self.continuous_cols,
                quantization_setup=self.quantization_setup,
                cols_to_scale=self.cols_to_scale,
                auto_embed_dim=self.auto_embed_dim,
                embedding_rule=self.embedding_rule,
                default_embed_dim=self.default_embed_dim,
                verbose=self.verbose,
                scale=self.scale,
                already_standard=self.already_standard,
                **self.kwargs,
            )

        self.is_fitted = False

    def fit(self, df: pd.DataFrame) -> "DINPreprocessor":
        if self.has_standard_tab_data:
            self.tab_preprocessor.fit(df)
            self.din_columns_idx = {
                col: i for i, col in enumerate(self.tab_preprocessor.column_idx.keys())
            }
        else:
            self.din_columns_idx = {}

        self._fit_label_encoders(df)
        self.is_fitted = True
        return self

    def transform(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
        _df = self._pre_transform(df)
        df_w_sequences = self._build_sequences(_df)
        X_all = self._concatenate_features(df_w_sequences)
        return X_all, np.array(df_w_sequences[self.target_col].tolist())

    def fit_transform(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
        self.fit(df)
        return self.transform(df)

    def inverse_transform(self, X: np.ndarray) -> pd.DataFrame:
        raise NotImplementedError(
            "inverse_transform is not implemented for this preprocessor"
        )

    def _fit_label_encoders(self, df: pd.DataFrame) -> None:
        self.item_le = LabelEncoder(columns_to_encode=[self._get_item_col()])
        self.item_le.fit(df)
        self.n_items = max(self.item_le.encoding_dict[self._get_item_col()].values())
        user_behaviour_embed_size = self._get_embed_size(
            self.item_embed_col, self.n_items
        )
        self.user_behaviour_config = (
            [f"item_{i+1}" for i in range(self.max_seq_length)],
            self.n_items,
            user_behaviour_embed_size,
        )

        self._update_din_columns_idx("item", self.max_seq_length)
        self.din_columns_idx["target_item"] = len(self.din_columns_idx)

        if self.action_col:
            self._fit_action_label_encoder(df)
        if self.other_seq_embed_cols:
            self._fit_other_seq_label_encoders(df)

    def _fit_action_label_encoder(self, df: pd.DataFrame) -> None:
        self.action_le = LabelEncoder(columns_to_encode=[self.action_col])
        self.action_le.fit(df)
        self.n_actions = len(self.action_le.encoding_dict[self.action_col])
        self.action_seq_config = (
            [f"action_{i+1}" for i in range(self.max_seq_length)],
            self.n_actions,
        )
        self._update_din_columns_idx("action", self.max_seq_length)

    def _other_seq_cols_float_warning(self, df: pd.DataFrame) -> None:
        other_seq_cols = [
            col[0] if isinstance(col, tuple) else col
            for col in self.other_seq_embed_cols
        ]
        for col in other_seq_cols:
            if df[col].dtype == float:
                warnings.warn(
                    f"{col} is a float column. It will be converted to integers. "
                    "If this is not what you want, please convert it beforehand.",
                    UserWarning,
                )

    def _convert_other_seq_cols_to_int(self, df: pd.DataFrame) -> None:
        other_seq_cols = [
            col[0] if isinstance(col, tuple) else col
            for col in self.other_seq_embed_cols
        ]
        for col in other_seq_cols:
            if df[col].dtype == float:
                df[col] = df[col].astype(int)

    def _fit_other_seq_label_encoders(self, df: pd.DataFrame) -> None:
        self._other_seq_cols_float_warning(df)
        self._convert_other_seq_cols_to_int(df)
        self.other_seq_le = LabelEncoder(
            columns_to_encode=[
                col[0] if isinstance(col, tuple) else col
                for col in self.other_seq_embed_cols
            ]
        )
        self.other_seq_le.fit(df)
        other_seq_cols = [
            col[0] if isinstance(col, tuple) else col
            for col in self.other_seq_embed_cols
        ]
        self.n_other_seq_cols = {
            col: max(self.other_seq_le.encoding_dict[col].values())
            for col in other_seq_cols
        }
        other_seq_embed_sizes = {
            col: self._get_embed_size(col, self.n_other_seq_cols[col])
            for col in other_seq_cols
        }
        self.other_seq_config = [
            (
                self._get_seq_col_names(col),
                self.n_other_seq_cols[col],
                other_seq_embed_sizes[col],
            )
            for col in other_seq_cols
        ]
        self._update_din_columns_idx_for_other_seq(other_seq_cols)

    def _get_item_col(self) -> str:
        return (
            self.item_embed_col[0]
            if isinstance(self.item_embed_col, tuple)
            else self.item_embed_col
        )

    def _get_embed_size(self, col: Union[str, Tuple[str, int]], n_unique: int) -> int:
        return col[1] if isinstance(col, tuple) else embed_sz_rule(n_unique)

    def _get_seq_col_names(self, col: str) -> List[str]:
        return [f"{col}_{i+1}" for i in range(self.max_seq_length)]

    def _update_din_columns_idx(self, prefix: str, length: int) -> None:
        current_len = len(self.din_columns_idx)
        self.din_columns_idx.update(
            {f"{prefix}_{i+1}": i + current_len for i in range(length)}
        )

    def _update_din_columns_idx_for_other_seq(self, other_seq_cols: List[str]) -> None:
        current_len = len(self.din_columns_idx)
        for col in other_seq_cols:
            self.din_columns_idx.update(
                {f"{col}_{i+1}": i + current_len for i in range(self.max_seq_length)}
            )
            self.din_columns_idx[f"target_{col}"] = len(self.din_columns_idx)
            current_len = len(self.din_columns_idx)

    def _pre_transform(self, df: pd.DataFrame) -> pd.DataFrame:
        check_is_fitted(self, attributes=["din_columns_idx"])
        df = self.item_le.transform(df)
        if self.action_col:
            df = self.action_le.transform(df)
        if self.other_seq_embed_cols:
            df = self.other_seq_le.transform(df)
        return df

    def _build_sequences(self, df: pd.DataFrame) -> pd.DataFrame:
        # df.columns.tolist() -> to avoid annoying pandas warning
        return (
            df.groupby(self.user_id_col)[df.columns.tolist()]  # type: ignore[index]
            .apply(self._group_sequences)
            .reset_index(drop=True)
        )

    def _group_sequences(self, group: pd.DataFrame) -> pd.DataFrame:
        item_col = self._get_item_col()
        items = group[item_col].tolist()
        targets = group[self.target_col].tolist()
        drop_cols = [item_col, self.target_col]

        # sequences cannot be built with user, item pairs with only one
        # interaction
        if len(items) <= 1:
            return pd.DataFrame()

        sequences = [
            self._create_sequence(group, items, targets, i, drop_cols)
            for i in range(max(1, len(items) - self.max_seq_length))
        ]

        seq_df = pd.DataFrame(sequences)
        non_seq_cols = group.drop_duplicates([self.user_id_col]).drop(drop_cols, axis=1)
        return pd.merge(seq_df, non_seq_cols, on=self.user_id_col)

    def _create_sequence(
        self,
        group: pd.DataFrame,
        items: List[int],
        targets: List[int],
        i: int,
        drop_cols: List[str],
    ) -> Dict[str, Union[str, List[int]]]:
        end_idx = min(i + self.max_seq_length, len(items) - 1)
        item_sequences = items[i:end_idx]
        target_item = items[end_idx]
        target = targets[end_idx]

        sequence = {
            self.user_id_col: group.name,
            "items_sequence": item_sequences,
            "target_item": target_item,
        }

        if self.action_col:
            sequence.update(
                self._create_action_sequence(group, targets, i, target, drop_cols)
            )
        else:
            sequence[self.target_col] = target
            drop_cols.append(self.target_col)

        if self.other_seq_embed_cols:
            sequence.update(self._create_other_seq_sequence(group, i, drop_cols))

        return sequence

    def _create_action_sequence(
        self,
        group: pd.DataFrame,
        targets: List[int],
        i: int,
        target: int,
        drop_cols: List[str],
    ) -> Dict[str, Union[int, List[int]]]:
        if self.action_col != self.target_col:
            actions = group[self.action_col].tolist()
            action_sequences = (
                actions[i : i + self.max_seq_length]
                if i + self.max_seq_length < len(actions)
                else actions[i:-1]
            )
            drop_cols.append(self.action_col)
        else:
            target -= 1  # the 'transform' method adds 1 as it saves 0 for padding
            action_sequences = (
                targets[i : i + self.max_seq_length]
                if i + self.max_seq_length < len(targets)
                else targets[i:-1]
            )

        return {self.target_col: target, "actions_sequence": action_sequences}

    def _create_other_seq_sequence(
        self, group: pd.DataFrame, i: int, drop_cols: List[str]
    ) -> Dict[str, Union[int, List[int]]]:
        other_seq_cols = [
            col[0] if isinstance(col, tuple) else col
            for col in self.other_seq_embed_cols
        ]
        drop_cols += other_seq_cols
        other_seqs = {col: group[col].tolist() for col in other_seq_cols}

        other_seqs_sequences: Dict[str, Union[int, List[int]]] = {}
        for col in other_seq_cols:
            other_seqs_sequences[col] = (
                other_seqs[col][i : i + self.max_seq_length]
                if i + self.max_seq_length < len(other_seqs[col])
                else other_seqs[col][i:-1]
            )

        sequence: Dict[str, Union[int, List[int]]] = {}
        for col in other_seq_cols:
            sequence[f"{col}_sequence"] = other_seqs_sequences[col]
            sequence[f"target_{col}"] = (
                other_seqs[col][i + self.max_seq_length]
                if i + self.max_seq_length < len(other_seqs[col])
                else other_seqs[col][-1]
            )

        return sequence

    def _concatenate_features(self, df_w_sequences: pd.DataFrame) -> np.ndarray:
        user_behaviour_seq = df_w_sequences["items_sequence"].tolist()
        X_user_behaviour = np.vstack(
            [
                pad_sequences(seq, self.max_seq_length, pad_idx=0)
                for seq in user_behaviour_seq
            ]
        )
        X_target_item = np.array(df_w_sequences["target_item"].tolist()).reshape(-1, 1)
        X_all = np.concatenate([X_user_behaviour, X_target_item], axis=1)

        if self.has_standard_tab_data:
            X_tab = self.tab_preprocessor.transform(df_w_sequences)
            X_all = np.concatenate([X_tab, X_all], axis=1)

        if self.action_col:
            action_seq = df_w_sequences["actions_sequence"].tolist()
            X_actions = np.vstack(
                [
                    pad_sequences(seq, self.max_seq_length, pad_idx=0)
                    for seq in action_seq
                ]
            )
            X_all = np.concatenate([X_all, X_actions], axis=1)

        if self.other_seq_embed_cols:
            X_all = self._concatenate_other_seq_features(df_w_sequences, X_all)

        assert len(self.din_columns_idx) == X_all.shape[1], (
            f"Something went wrong. The number of columns in the final array "
            f"({X_all.shape[1]}) is different from the number of columns in "
            f"self.din_columns_idx ({len(self.din_columns_idx)})"
        )

        return X_all

    def _concatenate_other_seq_features(
        self, df_w_sequences: pd.DataFrame, X_all: np.ndarray
    ) -> np.ndarray:
        other_seq_cols = [
            col[0] if isinstance(col, tuple) else col
            for col in self.other_seq_embed_cols
        ]
        other_seq = {
            col: df_w_sequences[f"{col}_sequence"].tolist() for col in other_seq_cols
        }
        other_seq_target = {
            col: df_w_sequences[f"target_{col}"].tolist() for col in other_seq_cols
        }
        X_other_seq_arrays = []

        for col in other_seq_cols:
            X_other_seq_arrays.append(
                np.vstack(
                    [
                        pad_sequences(s, self.max_seq_length, pad_idx=0)
                        for s in other_seq[col]
                    ]
                )
            )
            X_other_seq_arrays.append(np.array(other_seq_target[col]).reshape(-1, 1))

        return np.concatenate([X_all] + X_other_seq_arrays, axis=1)

Chunked versions

Chunked versions of the preprocessors are also available. These are useful when the data is too big to fit in memory. See also the load_from_folder module in the library and the corresponding section here in the documentation.

Note that there is not a ChunkImagePreprocessor. This is because the processing of the images will occur inside the ImageFromFolder class in the load_from_folder module.

ChunkWidePreprocessor

Bases: WidePreprocessor

Preprocessor to prepare the wide input dataset

This Preprocessor prepares the data for the wide, linear component. This linear model is implemented via an Embedding layer that is connected to the output neuron. ChunkWidePreprocessor numerically encodes all the unique values of all categorical columns wide_cols + crossed_cols. See the Example below.

Parameters:

Name Type Description Default
wide_cols List[str]

List of strings with the name of the columns that will label encoded and passed through the wide component

required
crossed_cols Optional[List[Tuple[str, str]]]

List of Tuples with the name of the columns that will be 'crossed' and then label encoded. e.g. [('education', 'occupation'), ...]. For binary features, a cross-product transformation is 1 if and only if the constituent features are all 1, and 0 otherwise.

None

Attributes:

Name Type Description
wide_crossed_cols List

List with the names of all columns that will be label encoded

encoding_dict Dict

Dictionary where the keys are the result of pasting colname + '_' + column value and the values are the corresponding mapped integer.

inverse_encoding_dict Dict

the inverse encoding dictionary

wide_dim int

Dimension of the wide model (i.e. dim of the linear layer)

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import ChunkWidePreprocessor
>>> chunk = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l']})
>>> wide_cols = ['color']
>>> crossed_cols = [('color', 'size')]
>>> chunk_wide_preprocessor = ChunkWidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols,
... n_chunks=1)
>>> X_wide = chunk_wide_preprocessor.fit_transform(chunk)
Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
class ChunkWidePreprocessor(WidePreprocessor):
    r"""Preprocessor to prepare the wide input dataset

    This Preprocessor prepares the data for the wide, linear component.
    This linear model is implemented via an Embedding layer that is
    connected to the output neuron. `ChunkWidePreprocessor` numerically
    encodes all the unique values of all categorical columns `wide_cols +
    crossed_cols`. See the Example below.

    Parameters
    ----------
    wide_cols: List
        List of strings with the name of the columns that will label
        encoded and passed through the `wide` component
    crossed_cols: List, default = None
        List of Tuples with the name of the columns that will be `'crossed'`
        and then label encoded. e.g. _[('education', 'occupation'), ...]_. For
        binary features, a cross-product transformation is 1 if and only if
        the constituent features are all 1, and 0 otherwise.

    Attributes
    ----------
    wide_crossed_cols: List
        List with the names of all columns that will be label encoded
    encoding_dict: Dict
        Dictionary where the keys are the result of pasting `colname + '_' +
        column value` and the values are the corresponding mapped integer.
    inverse_encoding_dict: Dict
        the inverse encoding dictionary
    wide_dim: int
        Dimension of the wide model (i.e. dim of the linear layer)

    Examples
    --------
    >>> import pandas as pd
    >>> from pytorch_widedeep.preprocessing import ChunkWidePreprocessor
    >>> chunk = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l']})
    >>> wide_cols = ['color']
    >>> crossed_cols = [('color', 'size')]
    >>> chunk_wide_preprocessor = ChunkWidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols,
    ... n_chunks=1)
    >>> X_wide = chunk_wide_preprocessor.fit_transform(chunk)
    """

    def __init__(
        self,
        wide_cols: List[str],
        n_chunks: int,
        crossed_cols: Optional[List[Tuple[str, str]]] = None,
    ):
        super(ChunkWidePreprocessor, self).__init__(wide_cols, crossed_cols)

        self.n_chunks = n_chunks

        self.chunk_counter = 0

        self.is_fitted = False

    def partial_fit(self, chunk: pd.DataFrame) -> "ChunkWidePreprocessor":
        r"""Fits the Preprocessor and creates required attributes

        Parameters
        ----------
        chunk: pd.DataFrame
            Input pandas dataframe

        Returns
        -------
        ChunkWidePreprocessor
            `ChunkWidePreprocessor` fitted object
        """
        df_wide = self._prepare_wide(chunk)
        self.wide_crossed_cols = df_wide.columns.tolist()

        if self.chunk_counter == 0:
            self.glob_feature_set = set(
                self._make_global_feature_list(df_wide[self.wide_crossed_cols])
            )
        else:
            self.glob_feature_set.update(
                self._make_global_feature_list(df_wide[self.wide_crossed_cols])
            )

        self.chunk_counter += 1

        if self.chunk_counter == self.n_chunks:
            self.encoding_dict = {v: i + 1 for i, v in enumerate(self.glob_feature_set)}
            self.wide_dim = len(self.encoding_dict)
            self.inverse_encoding_dict = {k: v for v, k in self.encoding_dict.items()}
            self.inverse_encoding_dict[0] = "unseen"

            self.is_fitted = True

        return self

    def fit(self, df: pd.DataFrame) -> "ChunkWidePreprocessor":
        """
        Runs `partial_fit`. This is just to override the fit method in the base
        class. This class is not designed or thought to run fit
        """
        return self.partial_fit(df)

    def __repr__(self) -> str:
        list_of_params: List[str] = ["wide_cols={wide_cols}"]
        list_of_params.append("n_chunks={n_chunks}")
        if self.crossed_cols is not None:
            list_of_params.append("crossed_cols={crossed_cols}")
        all_params = ", ".join(list_of_params)
        return f"WidePreprocessor({all_params.format(**self.__dict__)})"

partial_fit

partial_fit(chunk)

Fits the Preprocessor and creates required attributes

Parameters:

Name Type Description Default
chunk DataFrame

Input pandas dataframe

required

Returns:

Type Description
ChunkWidePreprocessor

ChunkWidePreprocessor fitted object

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
def partial_fit(self, chunk: pd.DataFrame) -> "ChunkWidePreprocessor":
    r"""Fits the Preprocessor and creates required attributes

    Parameters
    ----------
    chunk: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    ChunkWidePreprocessor
        `ChunkWidePreprocessor` fitted object
    """
    df_wide = self._prepare_wide(chunk)
    self.wide_crossed_cols = df_wide.columns.tolist()

    if self.chunk_counter == 0:
        self.glob_feature_set = set(
            self._make_global_feature_list(df_wide[self.wide_crossed_cols])
        )
    else:
        self.glob_feature_set.update(
            self._make_global_feature_list(df_wide[self.wide_crossed_cols])
        )

    self.chunk_counter += 1

    if self.chunk_counter == self.n_chunks:
        self.encoding_dict = {v: i + 1 for i, v in enumerate(self.glob_feature_set)}
        self.wide_dim = len(self.encoding_dict)
        self.inverse_encoding_dict = {k: v for v, k in self.encoding_dict.items()}
        self.inverse_encoding_dict[0] = "unseen"

        self.is_fitted = True

    return self

fit

fit(df)

Runs partial_fit. This is just to override the fit method in the base class. This class is not designed or thought to run fit

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
304
305
306
307
308
309
def fit(self, df: pd.DataFrame) -> "ChunkWidePreprocessor":
    """
    Runs `partial_fit`. This is just to override the fit method in the base
    class. This class is not designed or thought to run fit
    """
    return self.partial_fit(df)

ChunkTabPreprocessor

Bases: TabPreprocessor

Preprocessor to prepare the deeptabular component input dataset

Parameters:

Name Type Description Default
n_chunks int

Number of chunks that the tabular dataset is divided by.

required
cat_embed_cols Optional[Union[List[str], List[Tuple[str, int]]]]

List containing the name of the categorical columns that will be represented by embeddings (e.g. ['education', 'relationship', ...]) or a Tuple with the name and the embedding dimension (e.g.: [ ('education',32), ('relationship',16), ...]).
Param alias: embed_cols

None
continuous_cols Optional[List[str]]

List with the name of the continuous cols

None
cols_and_bins Optional[Dict[str, List[float]]]

Continuous columns can be turned into categorical via pd.cut. 'cols_and_bins' is dictionary where the keys are the column names to quantize and the values are a list of scalars indicating the bin edges.
Param alias: quantization_setup

None
cols_to_scale Optional[Union[List[str], str]]

List with the names of the columns that will be standarised via sklearn's StandardScaler

None
default_embed_dim int

Dimension for the embeddings if the embed_dim is not provided in the cat_embed_cols parameter and auto_embed_dim is set to False.

16
with_attention bool

Boolean indicating whether the preprocessed data will be passed to an attention-based model (more precisely a model where all embeddings must have the same dimensions). If True, the param cat_embed_cols must just be a list containing just the categorical column names: e.g. ['education', 'relationship', ...]. This is because they will all be encoded using embeddings of the same dim, which will be specified later when the model is defined.
Param alias: for_transformer

False
with_cls_token bool

Boolean indicating if a '[CLS]' token will be added to the dataset when using attention-based models. The final hidden state corresponding to this token is used as the aggregated representation for classification and regression tasks. If not, the categorical (and continuous embeddings if present) will be concatenated before being passed to the final MLP (if present).

False
shared_embed bool

Boolean indicating if the embeddings will be "shared" when using attention-based models. The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

False
verbose int
1
scale bool

ℹ️ note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
Bool indicating whether or not to scale/standarise continuous cols. It is important to emphasize that all the DL models for tabular data in the library also include the possibility of normalising the input continuous features via a BatchNorm or a LayerNorm.
Param alias: scale_cont_cols.

False
already_standard Optional[List[str]]

ℹ️ note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
List with the name of the continuous cols that do not need to be scaled/standarised.

None

Other Parameters:

Name Type Description
**kwargs

pd.cut and StandardScaler related args

Attributes:

Name Type Description
cat_cols List[str]

List containing the names of the categorical columns after processing.

cols_and_bins Optional[Dict[str, Union[int, List[float]]]]

Dictionary containing the quantization setup after processing if quantization is requested. This is derived from the quantization_setup parameter.

quant_args Dict

Dictionary containing arguments passed to pandas.cut() function for quantization.

scale_args Dict

Dictionary containing arguments passed to StandardScaler.

label_encoder LabelEncoder

Instance of LabelEncoder used to encode categorical variables. See pytorch_widedeep.utils.dense_utils.LabelEncoder.

cat_embed_input List

List of tuples with the column name, number of individual values for that column and, if with_attention is set to False, the corresponding embeddings dim, e.g. [('education', 16, 10), ('relationship', 6, 8), ...].

standardize_cols List

List of the columns that will be standardized.

scaler StandardScaler

An instance of sklearn.preprocessing.StandardScaler if standardization is requested.

column_idx Dict

Dictionary where keys are column names and values are column indexes. This is necessary to slice tensors.

quantizer Quantizer

An instance of Quantizer if quantization is requested.

is_fitted bool

Boolean indicating if the preprocessor has been fitted.

Examples:

>>> import pandas as pd
>>> import numpy as np
>>> from pytorch_widedeep.preprocessing import ChunkTabPreprocessor
>>> np.random.seed(42)
>>> chunk_df = pd.DataFrame({'cat_col': np.random.choice(['A', 'B', 'C'], size=8),
... 'cont_col': np.random.uniform(1, 100, size=8)})
>>> cat_embed_cols = [('cat_col',4)]
>>> cont_cols = ['cont_col']
>>> tab_preprocessor = ChunkTabPreprocessor(
... n_chunks=1, cat_embed_cols=cat_embed_cols, continuous_cols=cont_cols
... )
>>> X_tab = tab_preprocessor.fit_transform(chunk_df)
>>> tab_preprocessor.cat_embed_cols
[('cat_col', 4)]
>>> tab_preprocessor.column_idx
{'cat_col': 0, 'cont_col': 1}
Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
class ChunkTabPreprocessor(TabPreprocessor):
    r"""Preprocessor to prepare the `deeptabular` component input dataset

    Parameters
    ----------
    n_chunks: int
        Number of chunks that the tabular dataset is divided by.
    cat_embed_cols: List, default = None
        List containing the name of the categorical columns that will be
        represented by embeddings (e.g. _['education', 'relationship', ...]_) or
        a Tuple with the name and the embedding dimension (e.g.: _[
        ('education',32), ('relationship',16), ...]_). <br/> Param alias:
        `embed_cols`
    continuous_cols: List, default = None
        List with the name of the continuous cols
    cols_and_bins: Dict, default = None
        Continuous columns can be turned into categorical via
        `pd.cut`. 'cols_and_bins' is dictionary where the keys are the column
        names to quantize and the values are a list of scalars indicating the
        bin edges. <br/> Param alias: `quantization_setup`
    cols_to_scale: List, default = None,
        List with the names of the columns that will be standarised via
        sklearn's `StandardScaler`
    default_embed_dim: int, default=16
        Dimension for the embeddings if the embed_dim is not provided in the
        `cat_embed_cols` parameter and `auto_embed_dim` is set to
        `False`.
    with_attention: bool, default = False
        Boolean indicating whether the preprocessed data will be passed to an
        attention-based model (more precisely a model where all embeddings
        must have the same dimensions). If `True`, the param `cat_embed_cols`
        must just be a list containing just the categorical column names:
        e.g.
        _['education', 'relationship', ...]_. This is because they will all be
         encoded using embeddings of the same dim, which will be specified
         later when the model is defined. <br/> Param alias:
         `for_transformer`
    with_cls_token: bool, default = False
        Boolean indicating if a `'[CLS]'` token will be added to the dataset
        when using attention-based models. The final hidden state
        corresponding to this token is used as the aggregated representation
        for classification and regression tasks. If not, the categorical
        (and continuous embeddings if present) will be concatenated before
        being passed to the final MLP (if present).
    shared_embed: bool, default = False
        Boolean indicating if the embeddings will be "shared" when using
        attention-based models. The idea behind `shared_embed` is
        described in the Appendix A in the [TabTransformer paper](https://arxiv.org/abs/2012.06678):
        _'The goal of having column embedding is to enable the model to
        distinguish the classes in one column from those in the other
        columns'_. In other words, the idea is to let the model learn which
        column is embedded at the time. See: `pytorch_widedeep.models.transformers._layers.SharedEmbeddings`.
    verbose: int, default = 1
    scale: bool, default = False
        :information_source: **note**: this arg will be removed in upcoming
         releases. Please use `cols_to_scale` instead. <br/> Bool indicating
         whether or not to scale/standarise continuous cols. It is important
         to emphasize that all the DL models for tabular data in the library
         also include the possibility of normalising the input continuous
         features via a `BatchNorm` or a `LayerNorm`. <br/> Param alias:
         `scale_cont_cols`.
    already_standard: List, default = None
        :information_source: **note**: this arg will be removed in upcoming
         releases. Please use `cols_to_scale` instead. <br/> List with the
         name of the continuous cols that do not need to be
         scaled/standarised.

    Other Parameters
    ----------------
    **kwargs: dict
        `pd.cut` and `StandardScaler` related args

    Attributes
    ----------
    cat_cols: List[str]
        List containing the names of the categorical columns after processing.
    cols_and_bins: Optional[Dict[str, Union[int, List[float]]]]
        Dictionary containing the quantization setup after processing if
        quantization is requested. This is derived from the
        quantization_setup parameter.
    quant_args: Dict
        Dictionary containing arguments passed to pandas.cut() function for quantization.
    scale_args: Dict
        Dictionary containing arguments passed to StandardScaler.
    label_encoder: LabelEncoder
        Instance of LabelEncoder used to encode categorical variables.
        See `pytorch_widedeep.utils.dense_utils.LabelEncoder`.
    cat_embed_input: List
        List of tuples with the column name, number of individual values for
        that column and, if `with_attention` is set to `False`, the
        corresponding embeddings dim, e.g. _[('education', 16, 10),
        ('relationship', 6, 8), ...]_.
    standardize_cols: List
        List of the columns that will be standardized.
    scaler: StandardScaler
        An instance of `sklearn.preprocessing.StandardScaler` if standardization
        is requested.
    column_idx: Dict
        Dictionary where keys are column names and values are column indexes.
        This is necessary to slice tensors.
    quantizer: Quantizer
        An instance of `Quantizer` if quantization is requested.
    is_fitted: bool
        Boolean indicating if the preprocessor has been fitted.

    Examples
    --------
    >>> import pandas as pd
    >>> import numpy as np
    >>> from pytorch_widedeep.preprocessing import ChunkTabPreprocessor
    >>> np.random.seed(42)
    >>> chunk_df = pd.DataFrame({'cat_col': np.random.choice(['A', 'B', 'C'], size=8),
    ... 'cont_col': np.random.uniform(1, 100, size=8)})
    >>> cat_embed_cols = [('cat_col',4)]
    >>> cont_cols = ['cont_col']
    >>> tab_preprocessor = ChunkTabPreprocessor(
    ... n_chunks=1, cat_embed_cols=cat_embed_cols, continuous_cols=cont_cols
    ... )
    >>> X_tab = tab_preprocessor.fit_transform(chunk_df)
    >>> tab_preprocessor.cat_embed_cols
    [('cat_col', 4)]
    >>> tab_preprocessor.column_idx
    {'cat_col': 0, 'cont_col': 1}
    """

    @alias("with_attention", ["for_transformer"])
    @alias("cat_embed_cols", ["embed_cols"])
    @alias("scale", ["scale_cont_cols"])
    @alias("cols_and_bins", ["quantization_setup"])
    def __init__(
        self,
        n_chunks: int,
        cat_embed_cols: Optional[Union[List[str], List[Tuple[str, int]]]] = None,
        continuous_cols: Optional[List[str]] = None,
        cols_and_bins: Optional[Dict[str, List[float]]] = None,
        cols_to_scale: Optional[Union[List[str], str]] = None,
        default_embed_dim: int = 16,
        with_attention: bool = False,
        with_cls_token: bool = False,
        shared_embed: bool = False,
        verbose: int = 1,
        *,
        scale: bool = False,
        already_standard: Optional[List[str]] = None,
        **kwargs,
    ):
        super(ChunkTabPreprocessor, self).__init__(
            cat_embed_cols=cat_embed_cols,
            continuous_cols=continuous_cols,
            quantization_setup=None,
            cols_to_scale=cols_to_scale,
            auto_embed_dim=False,
            embedding_rule="google",  # does not matter, irrelevant
            default_embed_dim=default_embed_dim,
            with_attention=with_attention,
            with_cls_token=with_cls_token,
            shared_embed=shared_embed,
            verbose=verbose,
            scale=scale,
            already_standard=already_standard,
            **kwargs,
        )

        self.n_chunks = n_chunks
        self.chunk_counter = 0

        self.cols_and_bins = cols_and_bins  # type: ignore[assignment]
        if self.cols_and_bins is not None:
            self.quantizer = Quantizer(self.cols_and_bins, **self.quant_args)

        self.embed_prepared = False
        self.continuous_prepared = False

    def partial_fit(self, df: pd.DataFrame) -> "ChunkTabPreprocessor":  # noqa: C901
        # df here, and throughout the class, is a chunk of the original df
        self.chunk_counter += 1

        df_adj = self._insert_cls_token(df) if self.with_cls_token else df.copy()

        self.column_idx: Dict[str, int] = {}

        # Categorical embeddings logic
        if self.cat_embed_cols is not None:
            if not self.embed_prepared:
                df_cat, self.cat_embed_dim = self._prepare_categorical(df_adj)
                self.label_encoder = LabelEncoder(
                    columns_to_encode=df_cat.columns.tolist(),
                    shared_embed=self.shared_embed,
                    with_attention=self.with_attention,
                )
                self.label_encoder.partial_fit(df_cat)
            else:
                df_cat = df_adj[self.cat_cols]
                self.label_encoder.partial_fit(df_cat)

            self.column_idx.update({k: v for v, k in enumerate(df_cat.columns)})

        # Continuous columns logic
        if self.continuous_cols is not None:
            if not self.continuous_prepared:
                df_cont, self.cont_embed_dim = self._prepare_continuous(df_adj)
            else:
                df_cont = df[self.continuous_cols]

            self.column_idx.update(
                {k: v + len(self.column_idx) for v, k in enumerate(df_cont.columns)}
            )

            if self.standardize_cols is not None:
                self.scaler.partial_fit(df_cont[self.standardize_cols].values)

        if self.chunk_counter == self.n_chunks:
            if self.cat_embed_cols is not None or self.cols_and_bins is not None:
                self.cat_embed_input: List[
                    Union[Tuple[str, int], Tuple[str, int, int]]
                ] = []

            if self.cat_embed_cols is not None:
                for k, v in self.label_encoder.encoding_dict.items():
                    if self.with_attention:
                        self.cat_embed_input.append((k, len(v)))
                    else:
                        self.cat_embed_input.append((k, len(v), self.cat_embed_dim[k]))

            if self.cols_and_bins is not None:
                assert self.cont_embed_dim is not None  # just to make mypy happy
                if self.with_attention:
                    for col, n_cat, _ in self.cont_embed_dim:
                        self.cat_embed_input.append((col, n_cat))
                else:
                    self.cat_embed_input.extend(self.cont_embed_dim)

            self.is_fitted = True

        return self

    def fit(self, df: pd.DataFrame) -> "ChunkTabPreprocessor":
        # just to override the fit method in the base class. This class is not
        # designed or thought to run fit
        return self.partial_fit(df)

    def _prepare_categorical(
        self, df: pd.DataFrame
    ) -> Tuple[pd.DataFrame, Dict[str, int]]:
        # When dealing with chunks we will NOT support the option of
        # automatically define embeddings as this implies going through the
        # entire dataset
        if isinstance(self.cat_embed_cols[0], tuple):
            self.cat_cols: List[str] = [emb[0] for emb in self.cat_embed_cols]
            cat_embed_dim: Dict[str, int] = dict(self.cat_embed_cols)  # type: ignore
        else:
            self.cat_cols = self.cat_embed_cols  # type: ignore[assignment]
            cat_embed_dim = {
                e: self.default_embed_dim for e in self.cat_embed_cols  # type: ignore[misc]
            }  # type: ignore

        self.embed_prepared = True

        return df[self.cat_cols], cat_embed_dim

    def _prepare_continuous(
        self, df: pd.DataFrame
    ) -> Tuple[pd.DataFrame, Optional[List[Tuple[str, int, int]]]]:
        # Standardization logic
        if self.cols_to_scale is not None:
            self.standardize_cols = (
                self.cols_to_scale
                if self.cols_to_scale != "all"
                else self.continuous_cols
            )
        elif self.scale:
            if self.already_standard is not None:
                self.standardize_cols = [
                    c
                    for c in self.continuous_cols  # type: ignore[misc]
                    if c not in self.already_standard
                ]
            else:
                self.standardize_cols = self.continuous_cols
        else:
            self.standardize_cols = None

        if self.standardize_cols is not None:
            self.scaler = StandardScaler(**self.scale_args)
        elif self.verbose:
            warnings.warn("Continuous columns will not be normalised")

        # Quantization logic
        if self.cols_and_bins is not None:
            # the quantized columns are then treated as categorical
            quant_cont_embed_input: Optional[List[Tuple[str, int, int]]] = []
            for col, val in self.cols_and_bins.items():
                if isinstance(val, int):
                    quant_cont_embed_input.append(
                        (
                            col,
                            val,
                            self.default_embed_dim,
                        )
                    )
                else:
                    quant_cont_embed_input.append(
                        (
                            col,
                            len(val) - 1,
                            self.default_embed_dim,
                        )
                    )
        else:
            quant_cont_embed_input = None

        self.continuous_prepared = True

        return df[self.continuous_cols], quant_cont_embed_input

    def __repr__(self) -> str:  # noqa: C901
        list_of_params: List[str] = []
        if self.n_chunks is not None:
            list_of_params.append("n_chunks={n_chunks}")
        if self.cat_embed_cols is not None:
            list_of_params.append("cat_embed_cols={cat_embed_cols}")
        if self.continuous_cols is not None:
            list_of_params.append("continuous_cols={continuous_cols}")
        if self.cols_and_bins is not None:
            list_of_params.append("cols_and_bins={cols_and_bins}")
        if self.cols_to_scale is not None:
            list_of_params.append("cols_to_scale={cols_to_scale}")
        if self.default_embed_dim != 16:
            list_of_params.append("default_embed_dim={default_embed_dim}")
        if self.with_attention:
            list_of_params.append("with_attention={with_attention}")
        if self.with_cls_token:
            list_of_params.append("with_cls_token={with_cls_token}")
        if self.shared_embed:
            list_of_params.append("shared_embed={shared_embed}")
        if self.verbose != 1:
            list_of_params.append("verbose={verbose}")
        if self.scale:
            list_of_params.append("scale={scale}")
        if self.already_standard is not None:
            list_of_params.append("already_standard={already_standard}")
        if len(self.quant_args) > 0:
            list_of_params.append(
                ", ".join([f"{k}" + "=" + f"{v}" for k, v in self.quant_args.items()])
            )
        if len(self.scale_args) > 0:
            list_of_params.append(
                ", ".join([f"{k}" + "=" + f"{v}" for k, v in self.scale_args.items()])
            )
        all_params = ", ".join(list_of_params)
        return f"ChunkTabPreprocessor({all_params.format(**self.__dict__)})"

ChunkTextPreprocessor

Bases: TextPreprocessor

Preprocessor to prepare the deeptext input dataset

Parameters:

Name Type Description Default
text_col str

column in the input dataframe containing either the texts or the filenames where the text documents are stored

required
n_chunks int

Number of chunks that the text dataset is divided by.

required
root_dir Optional[str]

If 'text_col' contains the filenames with the text documents, this is the path to the directory where those documents are stored.

None
max_vocab int

Maximum number of tokens in the vocabulary

30000
min_freq int

Minimum frequency for a token to be part of the vocabulary

5
maxlen int

Maximum length of the tokenized sequences

80
pad_first bool

Indicates whether the padding index will be added at the beginning or the end of the sequences

True
pad_idx int

padding index. Fastai's Tokenizer leaves 0 for the 'unknown' token.

1
word_vectors_path Optional[str]

Path to the pretrained word vectors

None
n_cpus Optional[int]

number of CPUs to used during the tokenization process

None
verbose int

Enable verbose output.

1

Attributes:

Name Type Description
vocab Vocab

an instance of pytorch_widedeep.utils.fastai_transforms.ChunkVocab

embedding_matrix ndarray

Array with the pretrained embeddings if word_vectors_path is not None

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import ChunkTextPreprocessor
>>> chunk_df = pd.DataFrame({'text_column': ["life is like a box of chocolates",
... "You never know what you're gonna get"]})
>>> chunk_text_preprocessor = ChunkTextPreprocessor(text_col='text_column', n_chunks=1,
... max_vocab=25, min_freq=1, maxlen=10, verbose=0, n_cpus=1)
>>> processed_chunk = chunk_text_preprocessor.fit_transform(chunk_df)
Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
class ChunkTextPreprocessor(TextPreprocessor):
    r"""Preprocessor to prepare the ``deeptext`` input dataset

    Parameters
    ----------
    text_col: str
        column in the input dataframe containing either the texts or the
        filenames where the text documents are stored
    n_chunks: int
        Number of chunks that the text dataset is divided by.
    root_dir: str, Optional, default = None
        If 'text_col' contains the filenames with the text documents, this is
        the path to the directory where those documents are stored.
    max_vocab: int, default=30000
        Maximum number of tokens in the vocabulary
    min_freq: int, default=5
        Minimum frequency for a token to be part of the vocabulary
    maxlen: int, default=80
        Maximum length of the tokenized sequences
    pad_first: bool,  default = True
        Indicates whether the padding index will be added at the beginning or the
        end of the sequences
    pad_idx: int, default = 1
        padding index. Fastai's Tokenizer leaves 0 for the 'unknown' token.
    word_vectors_path: str, Optional
        Path to the pretrained word vectors
    n_cpus: int, Optional, default = None
        number of CPUs to used during the tokenization process
    verbose: int, default 1
        Enable verbose output.

    Attributes
    ----------
    vocab: Vocab
        an instance of `pytorch_widedeep.utils.fastai_transforms.ChunkVocab`
    embedding_matrix: np.ndarray
        Array with the pretrained embeddings if `word_vectors_path` is not None

    Examples
    ---------
    >>> import pandas as pd
    >>> from pytorch_widedeep.preprocessing import ChunkTextPreprocessor
    >>> chunk_df = pd.DataFrame({'text_column': ["life is like a box of chocolates",
    ... "You never know what you're gonna get"]})
    >>> chunk_text_preprocessor = ChunkTextPreprocessor(text_col='text_column', n_chunks=1,
    ... max_vocab=25, min_freq=1, maxlen=10, verbose=0, n_cpus=1)
    >>> processed_chunk = chunk_text_preprocessor.fit_transform(chunk_df)
    """

    def __init__(
        self,
        text_col: str,
        n_chunks: int,
        root_dir: Optional[str] = None,
        max_vocab: int = 30000,
        min_freq: int = 5,
        maxlen: int = 80,
        pad_first: bool = True,
        pad_idx: int = 1,
        already_processed: Optional[bool] = False,
        word_vectors_path: Optional[str] = None,
        n_cpus: Optional[int] = None,
        verbose: int = 1,
    ):
        super(ChunkTextPreprocessor, self).__init__(
            text_col=text_col,
            max_vocab=max_vocab,
            min_freq=min_freq,
            maxlen=maxlen,
            pad_first=pad_first,
            pad_idx=pad_idx,
            already_processed=already_processed,
            word_vectors_path=word_vectors_path,
            n_cpus=n_cpus,
            verbose=verbose,
        )

        self.n_chunks = n_chunks
        self.root_dir = root_dir

        self.chunk_counter = 0

        self.is_fitted = False

    def partial_fit(self, df: pd.DataFrame) -> "ChunkTextPreprocessor":
        # df is a chunk of the original dataframe
        self.chunk_counter += 1

        texts = self._read_texts(df, self.root_dir)

        tokens = get_texts(texts, self.already_processed, self.n_cpus)

        if not hasattr(self, "vocab"):
            self.vocab = ChunkVocab(
                max_vocab=self.max_vocab,
                min_freq=self.min_freq,
                pad_idx=self.pad_idx,
                n_chunks=self.n_chunks,
            )

        self.vocab.fit(tokens)

        if self.chunk_counter == self.n_chunks:
            if self.verbose:
                print("The vocabulary contains {} tokens".format(len(self.vocab.stoi)))
            if self.word_vectors_path is not None:
                self.embedding_matrix = build_embeddings_matrix(
                    self.vocab, self.word_vectors_path, self.min_freq
                )

            self.is_fitted = True

        return self

    def fit(self, df: pd.DataFrame) -> "ChunkTextPreprocessor":
        # df is a chunk of the original dataframe
        return self.partial_fit(df)

    def __repr__(self) -> str:
        list_of_params: List[str] = ["text_col='{text_col}'"]
        if self.n_chunks is not None:
            list_of_params.append("n_chunks={n_chunks}")
        if self.root_dir is not None:
            list_of_params.append("root_dir={root_dir}")
        list_of_params.append("max_vocab={max_vocab}")
        list_of_params.append("min_freq={min_freq}")
        list_of_params.append("maxlen={maxlen}")
        list_of_params.append("pad_first={pad_first}")
        list_of_params.append("pad_idx={pad_idx}")
        if self.word_vectors_path is not None:
            list_of_params.append("word_vectors_path={word_vectors_path}")
        if self.n_cpus is not None:
            list_of_params.append("n_cpus={n_cpus}")
        list_of_params.append("verbose={verbose}")
        all_params = ", ".join(list_of_params)
        return f"ChunkTextPreprocessor({all_params.format(**self.__dict__)})"

ChunkHFPreprocessor

Bases: HFPreprocessor

Text processor to prepare the deeptext input dataset that is a wrapper around HuggingFace's tokenizers.

Hugginface Tokenizer's are already 'trained'. Therefore, unlike the ChunkTextPreprocessor this is mostly identical to the HFPreprocessor with the only difference that the class needs a 'text_col' parameter to be passed. Also the parameter encode_params is not really optional when using this class. It must be passed containing at least the 'max_length' encoding parameter. This is because we need to ensure that all sequences have the same length when encoding in chunks.

Parameters:

Name Type Description Default
model_name str

The model name from the transformers library e.g. 'bert-base-uncased'. Currently supported models are those from the families: BERT, RoBERTa, DistilBERT, ALBERT and ELECTRA.

required
text_col str

The column in the input dataframe containing the text data. When using the ChunkHFPreprocessor the text_col parameter is mandatory.

required
root_dir Optional[str]

The root directory where the text files are located. This is only needed if the text data is stored in text files. If the text data is stored in a column in the input dataframe, this parameter is not needed.

None
use_fast_tokenizer bool

Whether to use the fast tokenizer from HuggingFace or not

True
num_workers Optional[int]

Number of workers to use when preprocessing the text data. If not None, and use_fast_tokenizer is False, the text data will be preprocessed in parallel using the number of workers specified. If use_fast_tokenizer is True, this argument is ignored.

None
preprocessing_rules Optional[List[Callable[[str], str]]]

A list of functions to be applied to the text data before encoding. This can be useful to clean the text data before encoding. For example, removing html tags, special characters, etc.

None
tokenizer_params Optional[Dict[str, Any]]

Additional parameters to be passed to the HuggingFace's PreTrainedTokenizer.

None
encode_params Optional[Dict[str, Any]]

Additional parameters to be passed to the batch_encode_plus method of the HuggingFace's PreTrainedTokenizer. In the case of the ChunkHFPreprocessor, this parameter is not really Optional. It must be passed containing at least the 'max_length' encoding parameter

None

Attributes:

Name Type Description
is_fitted bool

Boolean indicating if the preprocessor has been fitted. This is a HuggingFacea tokenizer, so it is always considered fitted and this attribute is manually set to True internally. This parameter exists for consistency with the rest of the library and because is needed for some functionality in the library.

Source code in pytorch_widedeep/preprocessing/hf_preprocessor.py
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
class ChunkHFPreprocessor(HFPreprocessor):
    """Text processor to prepare the ``deeptext`` input dataset that is a
    wrapper around HuggingFace's tokenizers.

    Hugginface Tokenizer's are already 'trained'. Therefore, unlike the
    `ChunkTextPreprocessor` this is mostly identical to the `HFPreprocessor`
    with the only difference that the class needs a 'text_col' parameter to
    be passed. Also the parameter `encode_params` is not really optional when
    using this class. It must be passed containing at least the
    'max_length' encoding parameter. This is because we need to ensure that
     all sequences have the same length when encoding in chunks.

    Parameters
    ----------
    model_name: str
        The model name from the transformers library e.g. _'bert-base-uncased'_.
        Currently supported models are those from the families: BERT, RoBERTa,
        DistilBERT, ALBERT and ELECTRA.
    text_col: str, default = None
        The column in the input dataframe containing the text data. When using
        the `ChunkHFPreprocessor` the `text_col` parameter is mandatory.
    root_dir: Optional[str], default = None
        The root directory where the text files are located. This is only
        needed if the text data is stored in text files. If the text data is
        stored in a column in the input dataframe, this parameter is not
        needed.
    use_fast_tokenizer: bool, default = False
        Whether to use the fast tokenizer from HuggingFace or not
    num_workers: Optional[int], default = None
        Number of workers to use when preprocessing the text data. If not
        None, and `use_fast_tokenizer` is False, the text data will be
        preprocessed in parallel using the number of workers specified. If
        `use_fast_tokenizer` is True, this argument is ignored.
    preprocessing_rules: Optional[List[Callable[[str], str]]], default = None
        A list of functions to be applied to the text data before encoding.
        This can be useful to clean the text data before encoding. For
        example, removing html tags, special characters, etc.
    tokenizer_params: Optional[Dict[str, Any]], default = None
        Additional parameters to be passed to the HuggingFace's
        `PreTrainedTokenizer`.
    encode_params: Optional[Dict[str, Any]], default = None
        Additional parameters to be passed to the `batch_encode_plus` method
        of the HuggingFace's `PreTrainedTokenizer`. In the case of the
        `ChunkHFPreprocessor`, this parameter is not really `Optional`. It
        must be passed containing at least the 'max_length' encoding
        parameter

    Attributes
    ----------
    is_fitted: bool
        Boolean indicating if the preprocessor has been fitted. This is a
        HuggingFacea tokenizer, so it is always considered fitted and this
        attribute is manually set to True internally. This parameter exists
        for consistency with the rest of the library and because is needed
        for some functionality in the library.

    """

    def __init__(
        self,
        model_name: str,
        *,
        text_col: str,
        root_dir: Optional[str] = None,
        use_fast_tokenizer: bool = True,
        num_workers: Optional[int] = None,
        preprocessing_rules: Optional[List[Callable[[str], str]]] = None,
        tokenizer_params: Optional[Dict[str, Any]] = None,
        encode_params: Optional[Dict[str, Any]] = None,
    ):
        super().__init__(
            model_name=model_name,
            use_fast_tokenizer=use_fast_tokenizer,
            text_col=text_col,
            num_workers=num_workers,
            preprocessing_rules=preprocessing_rules,
            tokenizer_params=tokenizer_params,
            encode_params=encode_params,
        )

        self.root_dir = root_dir

        # when using in chunks encode_params is not really optional. I will
        # review types in due time
        if self.encode_params is None:
            raise ValueError(
                "The 'encode_params' dict must be passed to the ChunkHFTokenizer "
                "containing at least the 'max_length' encoding parameter"
            )

        if "padding" not in self.encode_params or not self.encode_params["padding"]:
            self.encode_params["padding"] = True

        if (
            "truncation" not in self.encode_params
            or not self.encode_params["truncation"]
        ):
            self.encode_params["truncation"] = True

    def __repr__(self):
        return (
            f"ChunkHFPreprocessor(text_col={self.text_col}, model_name={self.model_name}, "
            f"use_fast_tokenizer={self.use_fast_tokenizer}, num_workers={self.num_workers}, "
            f"preprocessing_rules={self.preprocessing_rules}, tokenizer_params={self.tokenizer_params}, "
            f"encode_params={self.encode_params}, root_dir={self.root_dir})"
        )