importTools module

class cranetoolbox.importTools.transform.TransformationOptions(languagefilter: str, retweets: bool, max_in_mem: int, text_field_key: str, id_field_key: str, date_field_key: str)

A simple class to handle ETL options as specified in the driver

cranetoolbox.importTools.transform.filter_lighten_chunk(chunk, opts: cranetoolbox.importTools.transform.TransformationOptions) -> (typing.List[dict], <class 'int'>)

Filter and lighten a given set of lines, keeping only important keys

Parameters
Returns

List of filtered tweets and parse failure count

Return type

list(dict), int

cranetoolbox.importTools.transform.is_retweet(tweet: dict, text_field_key: str)bool

Check whether a tweet is a retweet.

Parameters
  • tweet (str or None) – A dictionary representing a single tweet

  • tweet – Oser-defined name for the “text” field

Returns

True/False if the tweet has been labeled as a retweet.

Return type

bool

Note

Check for automated retweets with the retweet flag and for manual retweets of the form “RT [original tweet]”

cranetoolbox.importTools.transform.lighten_tweet(tweet: dict, text_field_key: str, id_field_key: str, date_field_key: str) -> (<class 'str'>, <class 'str'>, <class 'str'>)

Lighten a tweet by returning only the fields required for analysis.

Parameters
  • tweet (dict) – A parsed JSON tweet

  • text_field_key (str) – User-defined name for the “text” field

  • id_field_key (str) – User-defined name for the “id” field

  • date_field_key (str) – User-defined name for the “created_at” field

Returns

A tuple of the three values scraped from the passed tweet

Return type

tuple(str, str, str)

cranetoolbox.importTools.transform.matches_language_filter(tweet: dict, opts: cranetoolbox.importTools.transform.TransformationOptions)bool

Check whether a tweet is in the desired language. If JSON key does not exist it assumes that the text matches the language filter(returns True)

Parameters
  • tweet (dict) – A dictionary representing a full JSON tweet(with no data removed etc)

  • opts (TransformationOptions) – Transformation options

Returns

True/False if the tweet matches the user specified language filter

Return type

bool

cranetoolbox.importTools.transform.parse_tweet(tweet: str)dict

Parse the passed JSON format tweet from str to dictionary.

Parameters

tweet (str) – JSON tweet as a string

Returns

Dictionary representing the JSON parse results of the passed tweet

Return type

dict

cranetoolbox.importTools.transform.process_files(file_list: List[str], opts: cranetoolbox.importTools.transform.TransformationOptions, csv_output_path: str) -> (<class 'int'>, <class 'int'>)

Top-level function to combine input set into a single CSV file.

Parameters
  • file_list (list(str)) – paths to files to be processed

  • opts (TransformationOptions) – An instance of the transformation options, used to control filtering and parsing of tweets

  • csv_output_path (str) – Full output path, folder, filename and extension

Returns

A tuple of (successes, failures) that represents the number of lines written to the file

Return type

tuple(int, int)

cranetoolbox.importTools.transform.process_tar_file(file: str, opts: cranetoolbox.importTools.transform.TransformationOptions, csv_output_path: str) -> (<class 'int'>, <class 'int'>)

Process any uncompressed nested files contained within a single tar file.

Parameters
  • file (str) – Path to tar file

  • opts (TransformationOptions) – Transformation options

  • csv_output_path (str) – Output path for the combined CSV file

Returns

Pass/fail counts

Return type

tuple(int, int)

Warning

This cannot handle nested compression, ie a tar inside a tar.

cranetoolbox.importTools.transform.write_tweets_by_chunk(lines, csv_output_path: str, opts: cranetoolbox.importTools.transform.TransformationOptions) -> (<class 'int'>, <class 'int'>)

Process an arbitrary number of lines and save them to the CSV outfile

Parameters
  • lines (list(str) or buffer) – Lines of tweets to process and write to file

  • csv_output_path (str) – Full output path of CSV file, including filename and extension

  • opts (TransformationOptions) – Transformation options

Returns

Tuple of write pass/failures

Return type

tuple(int, int)