preprocess module¶
-
cranetoolbox.preprocess.preprocess.count_per_day(data: [<class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>]) → pandas.core.frame.DataFrame¶ Count occurences per day in a list of data records.
-
cranetoolbox.preprocess.preprocess.merge_counts_dataframe(counts_list: List[pandas.core.frame.DataFrame]) → Optional[pandas.core.frame.DataFrame]¶ Get merged counts per day from a list of DataFrame
- Parameters
counts_list (list(pandas.DataFrame)) – List of DataFrame with the count of occurences per day.
- Returns
A DataFrame, sum of the info contained in the input list
- Return type
pandas.DataFrame
-
cranetoolbox.preprocess.preprocess.preprocess_csv_file(csv_reader: _csv.reader, file_path: str, output_path: str, replace_or_remove_url: bool, replace_or_remove_mentions: bool, remove_hashtag_or_segment: bool, replace_or_remove_punctuation: bool, replace_or_remove_numbers: bool) → Optional[pandas.core.frame.DataFrame]¶ Preprocess a single CSV file.
- Parameters
csv_reader (csv.reader) – The reader for the input CSV file, without header.
file_path (str) – The path to the input file.
output_path (str) – The path to the output folder.
replace_or_remove_url (bool) – True to replace URLs, False to remove them.
replace_or_remove_mentions (bool) – True to replace mentions, False to remove them.
remove_hashtag_or_segment (bool) – True to remove ‘#’ in front of hashtags, False to segment hashtags.
replace_or_remove_punctuation (bool) – True to replace multiple punctuation, False to remove all punctuation.
replace_or_remove_numbers (bool) – True to replace numbers by their text version, False to remove them.
- Returns
Dataframe of processed CSV file
- Return type
pd.DataFrame
-
cranetoolbox.preprocess.preprocess.preprocessing_text(text: str, replace_or_remove_url: bool, replace_or_remove_mentions: bool, remove_hashtag_or_segment: bool, replace_or_remove_punctuation: bool, replace_or_remove_numbers: bool) → str¶ Preprocess the text content of a tweet for analysis.
- Parameters
text (str) – Text content of a tweet.
replace_or_remove_url (bool) – True to replace URLs, False to remove them.
replace_or_remove_mentions (bool) – True to replace mentions, False to remove them.
remove_hashtag_or_segment (bool) – True to remove ‘#’ in front of hashtags, False to segment hashtags.
replace_or_remove_punctuation (bool) – True to replace multiple punctuation, False to remove all punctuation.
replace_or_remove_numbers (bool) – True to replace numbers by their text version, False to remove them.
- Returns
The clean version of the text.
- Return type
-
cranetoolbox.preprocess.preprocess.preprocessing_tweet(tweet: [<class 'str'>, <class 'str'>, <class 'str'>], replace_or_remove_url: bool, replace_or_remove_mentions: bool, remove_hashtag_or_segment: bool, replace_or_remove_punctuation: bool, replace_or_remove_numbers: bool) → [<class ‘str’>, <class ‘str’>, <class ‘str’>, <class ‘str’>]¶ Preprocess the text content of a tweet for analysis.
- Parameters
tweet (list()) – An array with the tweet info, in format [id, original_text, timestamp].
replace_or_remove_url (bool) – True to replace URLs, False to remove them.
replace_or_remove_mentions (bool) – True to replace mentions, False to remove them.
remove_hashtag_or_segment (bool) – True to remove ‘#’ in front of hashtags, False to segment hashtags.
replace_or_remove_punctuation (bool) – True to replace multiple punctuation, False to remove all punctuation.
replace_or_remove_numbers (bool) – True to replace numbers by their text version, False to remove them.
- Returns
An array with the tweet info, including clean text, in format [id, original_text, clean_text, timestamp].
- Return type
list()
-
cranetoolbox.preprocess.preprocessTools.remove_escaped_unicode(text: str) → str¶ Removes escaped unicode characters from the text
-
cranetoolbox.preprocess.preprocessTools.remove_hashtag_in_front_of_word(text: str) → str¶ Removes hastag in front of a word
-
cranetoolbox.preprocess.preprocessTools.remove_non_ascii(text: str) → str¶ Removes non ascii characters from the text
-
cranetoolbox.preprocess.preprocessTools.remove_punctuation(text: str) → str¶ Removes punctuation symbols, except hyphens
-
cranetoolbox.preprocess.preprocessTools.replace_at_user(text: str) → str¶ Replaces “@user” with “atUser”
-
cranetoolbox.preprocess.preprocessTools.replace_contraction(text: str) → str¶ Replaces contractions from a string to their equivalents
Removes hastag in front of a word and add hashtag segmentation
-
cranetoolbox.preprocess.preprocessTools.replace_multi_exclamation_mark(text: str) → str¶ Replaces repetitions of exlamation marks
-
cranetoolbox.preprocess.preprocessTools.replace_multi_question_mark(text: str) → str¶ Replaces repetitions of question marks
-
cranetoolbox.preprocess.preprocessTools.replace_multi_stop_mark(text: str) → str¶ Replaces repetitions of stop marks
-
cranetoolbox.preprocess.preprocessTools.replace_new_line(text: str) → str¶ Replaces new lines with spaces
-
cranetoolbox.preprocess.preprocessTools.replace_numbers(text: str) → str¶ Replaces numbers with their text version