preprocess module

cranetoolbox.preprocess.preprocess.count_per_day(data: [<class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>]) → pandas.core.frame.DataFrame

Count occurences per day in a list of data records.

Parameters

data (list(list())) – List of data records, where the timestamp is the fourth column.

Returns

A DataFrame with the count of occurences per day.

Return type

pandas.DataFrame

cranetoolbox.preprocess.preprocess.merge_counts_dataframe(counts_list: List[pandas.core.frame.DataFrame]) → Optional[pandas.core.frame.DataFrame]

Get merged counts per day from a list of DataFrame

Parameters

counts_list (list(pandas.DataFrame)) – List of DataFrame with the count of occurences per day.

Returns

A DataFrame, sum of the info contained in the input list

Return type

pandas.DataFrame

cranetoolbox.preprocess.preprocess.preprocess_csv_file(csv_reader: _csv.reader, file_path: str, output_path: str, replace_or_remove_url: bool, replace_or_remove_mentions: bool, remove_hashtag_or_segment: bool, replace_or_remove_punctuation: bool, replace_or_remove_numbers: bool) → Optional[pandas.core.frame.DataFrame]

Preprocess a single CSV file.

Parameters
  • csv_reader (csv.reader) – The reader for the input CSV file, without header.

  • file_path (str) – The path to the input file.

  • output_path (str) – The path to the output folder.

  • replace_or_remove_url (bool) – True to replace URLs, False to remove them.

  • replace_or_remove_mentions (bool) – True to replace mentions, False to remove them.

  • remove_hashtag_or_segment (bool) – True to remove ‘#’ in front of hashtags, False to segment hashtags.

  • replace_or_remove_punctuation (bool) – True to replace multiple punctuation, False to remove all punctuation.

  • replace_or_remove_numbers (bool) – True to replace numbers by their text version, False to remove them.

Returns

Dataframe of processed CSV file

Return type

pd.DataFrame

cranetoolbox.preprocess.preprocess.preprocessing_text(text: str, replace_or_remove_url: bool, replace_or_remove_mentions: bool, remove_hashtag_or_segment: bool, replace_or_remove_punctuation: bool, replace_or_remove_numbers: bool)str

Preprocess the text content of a tweet for analysis.

Parameters
  • text (str) – Text content of a tweet.

  • replace_or_remove_url (bool) – True to replace URLs, False to remove them.

  • replace_or_remove_mentions (bool) – True to replace mentions, False to remove them.

  • remove_hashtag_or_segment (bool) – True to remove ‘#’ in front of hashtags, False to segment hashtags.

  • replace_or_remove_punctuation (bool) – True to replace multiple punctuation, False to remove all punctuation.

  • replace_or_remove_numbers (bool) – True to replace numbers by their text version, False to remove them.

Returns

The clean version of the text.

Return type

str

cranetoolbox.preprocess.preprocess.preprocessing_tweet(tweet: [<class 'str'>, <class 'str'>, <class 'str'>], replace_or_remove_url: bool, replace_or_remove_mentions: bool, remove_hashtag_or_segment: bool, replace_or_remove_punctuation: bool, replace_or_remove_numbers: bool) → [<class ‘str’>, <class ‘str’>, <class ‘str’>, <class ‘str’>]

Preprocess the text content of a tweet for analysis.

Parameters
  • tweet (list()) – An array with the tweet info, in format [id, original_text, timestamp].

  • replace_or_remove_url (bool) – True to replace URLs, False to remove them.

  • replace_or_remove_mentions (bool) – True to replace mentions, False to remove them.

  • remove_hashtag_or_segment (bool) – True to remove ‘#’ in front of hashtags, False to segment hashtags.

  • replace_or_remove_punctuation (bool) – True to replace multiple punctuation, False to remove all punctuation.

  • replace_or_remove_numbers (bool) – True to replace numbers by their text version, False to remove them.

Returns

An array with the tweet info, including clean text, in format [id, original_text, clean_text, timestamp].

Return type

list()

cranetoolbox.preprocess.preprocessTools.remove_at_user(text: str)str

Removes “@user”

cranetoolbox.preprocess.preprocessTools.remove_escaped_unicode(text: str)str

Removes escaped unicode characters from the text

cranetoolbox.preprocess.preprocessTools.remove_hashtag_in_front_of_word(text: str)str

Removes hastag in front of a word

cranetoolbox.preprocess.preprocessTools.remove_non_ascii(text: str)str

Removes non ascii characters from the text

cranetoolbox.preprocess.preprocessTools.remove_numbers(text: str)str

Removes integers

cranetoolbox.preprocess.preprocessTools.remove_punctuation(text: str)str

Removes punctuation symbols, except hyphens

cranetoolbox.preprocess.preprocessTools.remove_url(text: str)str

Removes url address

cranetoolbox.preprocess.preprocessTools.replace_at_user(text: str)str

Replaces “@user” with “atUser”

cranetoolbox.preprocess.preprocessTools.replace_contraction(text: str)str

Replaces contractions from a string to their equivalents

cranetoolbox.preprocess.preprocessTools.replace_hashtags(text: str)str

Removes hastag in front of a word and add hashtag segmentation

cranetoolbox.preprocess.preprocessTools.replace_multi_exclamation_mark(text: str)str

Replaces repetitions of exlamation marks

cranetoolbox.preprocess.preprocessTools.replace_multi_question_mark(text: str)str

Replaces repetitions of question marks

cranetoolbox.preprocess.preprocessTools.replace_multi_stop_mark(text: str)str

Replaces repetitions of stop marks

cranetoolbox.preprocess.preprocessTools.replace_new_line(text: str)str

Replaces new lines with spaces

cranetoolbox.preprocess.preprocessTools.replace_numbers(text: str)str

Replaces numbers with their text version

cranetoolbox.preprocess.preprocessTools.replace_url(text: str)str

Replaces url address with “url”

cranetoolbox.preprocess.preprocessTools.segment_hashtag(text: str)str

Removes hastag in front of a word and add hashtag segmentation