Text utils¶
Collection of helper function that facilitate processing text.
simple_preprocess ¶
simple_preprocess(
doc, lower=False, deacc=False, min_len=2, max_len=15
)
This is Gensim's simple_preprocess with a lower param to
indicate wether or not to lower case all the token in the doc
For more information see: Gensim utils module
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc
|
str
|
Input document. |
required |
lower
|
bool
|
Lower case tokens in the input doc |
False
|
deacc
|
bool
|
Remove accent marks from tokens using |
False
|
min_len
|
int
|
Minimum length of token (inclusive). Shorter tokens are discarded. |
2
|
max_len
|
int
|
Maximum length of token in result (inclusive). Longer tokens are discarded. |
15
|
Examples:
>>> from pytorch_widedeep.utils import simple_preprocess
>>> simple_preprocess('Machine learning is great')
['Machine', 'learning', 'is', 'great']
Returns:
| Type | Description |
|---|---|
List[str]
|
List with the processed tokens |
Source code in pytorch_widedeep/utils/text_utils.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | |
get_texts ¶
get_texts(texts, already_processed=False, n_cpus=None)
Tokenization using Fastai's Tokenizer because it does a
series of very convenients things during the tokenization process
See pytorch_widedeep.utils.fastai_utils.Tokenizer
NOTE:
get_texts uses pytorch_widedeep.utils.fastai_transforms.Tokenizer.
Such tokenizer uses a series of convenient processing steps, including
the addition of some special tokens, such as TK_MAJ (xxmaj), used to
indicate the next word begins with a capital in the original text. For more
details of special tokens please see the fastai `docs
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
List[str]
|
List of str with the texts (or documents). One str per document |
required |
already_processed
|
Optional[bool]
|
Boolean indicating if the text is already processed and we simply want to tokenize it. This parameter is thought for those cases where the input sequences might not be text (but IDs, or anything else) and we just want to tokenize it |
False
|
n_cpus
|
Optional[int]
|
number of CPUs to used during the tokenization process |
None
|
Examples:
>>> from pytorch_widedeep.utils import get_texts
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> get_texts(texts)
[['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]
Returns:
| Type | Description |
|---|---|
List[List[str]]
|
List of lists, one list per 'document' containing its corresponding tokens |
Source code in pytorch_widedeep/utils/text_utils.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | |
pad_sequences ¶
pad_sequences(seq, maxlen, pad_first=True, pad_idx=1)
Given a List of tokenized and numericalised sequences it will return
padded sequences according to the input parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seq
|
List[int]
|
List of int with the |
required |
maxlen
|
int
|
Maximum length of the padded sequences |
required |
pad_first
|
bool
|
Indicates whether the padding index will be added at the beginning or the end of the sequences |
True
|
pad_idx
|
int
|
padding index. Fastai's Tokenizer leaves 0 for the 'unknown' token. |
1
|
Examples:
>>> from pytorch_widedeep.utils import pad_sequences
>>> seq = [1,2,3]
>>> pad_sequences(seq, maxlen=5, pad_idx=0)
array([0, 0, 1, 2, 3], dtype=int32)
Returns:
| Type | Description |
|---|---|
ndarray
|
numpy array with the padded sequences |
Source code in pytorch_widedeep/utils/text_utils.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
build_embeddings_matrix ¶
build_embeddings_matrix(
vocab, word_vectors_path, min_freq, verbose=1
)
Build the embedding matrix using pretrained word vectors.
Returns pretrained word embeddings. If a word in our vocabulary is not among the pretrained embeddings it will be assigned the mean pretrained word-embeddings vector
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab
|
Union[Vocab, ChunkVocab]
|
see |
required |
word_vectors_path
|
str
|
path to the pretrained word embeddings |
required |
min_freq
|
int
|
minimum frequency required for a word to be in the vocabulary |
required |
verbose
|
int
|
level of verbosity. Set to 0 for no verbosity |
1
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Pretrained word embeddings |
Source code in pytorch_widedeep/utils/text_utils.py
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | |