importTools module¶
-
class
cranetoolbox.importTools.transform.TransformationOptions(languagefilter: str, retweets: bool, max_in_mem: int, text_field_key: str, id_field_key: str, date_field_key: str)¶ A simple class to handle ETL options as specified in the driver
-
cranetoolbox.importTools.transform.filter_lighten_chunk(chunk, opts: cranetoolbox.importTools.transform.TransformationOptions) -> (typing.List[dict], <class 'int'>)¶ Filter and lighten a given set of lines, keeping only important keys
-
cranetoolbox.importTools.transform.is_retweet(tweet: dict, text_field_key: str) → bool¶ Check whether a tweet is a retweet.
- Parameters
- Returns
True/False if the tweet has been labeled as a retweet.
- Return type
Note
Check for automated retweets with the retweet flag and for manual retweets of the form “RT [original tweet]”
-
cranetoolbox.importTools.transform.lighten_tweet(tweet: dict, text_field_key: str, id_field_key: str, date_field_key: str) -> (<class 'str'>, <class 'str'>, <class 'str'>)¶ Lighten a tweet by returning only the fields required for analysis.
- Parameters
- Returns
A tuple of the three values scraped from the passed tweet
- Return type
-
cranetoolbox.importTools.transform.matches_language_filter(tweet: dict, opts: cranetoolbox.importTools.transform.TransformationOptions) → bool¶ Check whether a tweet is in the desired language. If JSON key does not exist it assumes that the text matches the language filter(returns True)
- Parameters
tweet (dict) – A dictionary representing a full JSON tweet(with no data removed etc)
opts (TransformationOptions) – Transformation options
- Returns
True/False if the tweet matches the user specified language filter
- Return type
-
cranetoolbox.importTools.transform.parse_tweet(tweet: str) → dict¶ Parse the passed JSON format tweet from str to dictionary.
-
cranetoolbox.importTools.transform.process_files(file_list: List[str], opts: cranetoolbox.importTools.transform.TransformationOptions, csv_output_path: str) -> (<class 'int'>, <class 'int'>)¶ Top-level function to combine input set into a single CSV file.
- Parameters
opts (TransformationOptions) – An instance of the transformation options, used to control filtering and parsing of tweets
csv_output_path (str) – Full output path, folder, filename and extension
- Returns
A tuple of (successes, failures) that represents the number of lines written to the file
- Return type
-
cranetoolbox.importTools.transform.process_tar_file(file: str, opts: cranetoolbox.importTools.transform.TransformationOptions, csv_output_path: str) -> (<class 'int'>, <class 'int'>)¶ Process any uncompressed nested files contained within a single tar file.
- Parameters
file (str) – Path to tar file
opts (TransformationOptions) – Transformation options
csv_output_path (str) – Output path for the combined CSV file
- Returns
Pass/fail counts
- Return type
Warning
This cannot handle nested compression, ie a tar inside a tar.
-
cranetoolbox.importTools.transform.write_tweets_by_chunk(lines, csv_output_path: str, opts: cranetoolbox.importTools.transform.TransformationOptions) -> (<class 'int'>, <class 'int'>)¶ Process an arbitrary number of lines and save them to the CSV outfile