dativa.tools.pandas.CSVHandler
A wrapper for pandas CSV handling to read and write DataFrames with consistent CSV parameters by sniffing the parameters automatically. Includes reading a CSV into a DataFrame, and writing it out to a string. Files can be read/written from/to local file system or AWS S3.
For S3 access suitable credentials should be available in '~/.aws/credentials' or the AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY environment variables.
CSVHandler
- base_path - the base path for any CSV file read, defaults to ""
- detect_parameters - whether the encoding of the CSV file should be automatically detected, defaults to False
- csv_encoding - the encoding of the CSV files, defaults to UTF-8
- csv_delimiter - the delimeter used in the CSV, defaults to ','
- csv_header - the index of the header row, or -1 if there is no header
- csv_skiprows - the number of rows at the beginning of file to skip
- csv_quotechar - the quoting character to use, defaults to "
load_df
Opens a CSV file using the specified configuration for the class and raises an exception if the encoding is unparseable. Detects if base_path is an S3 location and loads data from there if required.
Parameters:
- file - File path. Should begin with 's3://' to load from S3 location.
- force_dtype - Force data type for data or columns, defaults to None
Returns:
- dataframe
save_df
Writes a formatted string from a dataframe using the specified configuration for the class the file. Detects if base_path is an S3 location and saves data there if required.
Parameters:
- df - Dataframe to save
- file - File path. Should begin with 's3://' to save to an S3 location.
df_to_string
Returns a formatted string from a dataframe using the specified configuration for the class.
Parameters:
- df - Dataframe to convert to string
Returns:
- string
Example code
from dativa.tools.pandas import CSVHandler
# Create the CSV handler
csv = CSVHandler(base_path='s3://my-bucket-name/')
# Load a file
df = csv.load_df('my-file-name.csv')
# Create a string
str_df = csv.df_to_string(df)
# Save a file
csv.save_df(df, 'another-path/another-file-name.csv')
Support functions for Pandas
- dativa.tools.pandas.is_numeric - a function to check whether a series or string is numeric
- dativa.tools.pandas.string_to_datetime - a function to convert a string, or series of strings to a datetime, with a strptime date format that supports nanoseconds
- dativa.tools.pandas.datetime_to_string - a function to convert a datetime, or a series of datetimes to a string, with a strptime date format that supports nanoseconds
- dativa.tools.pandas.format_string_is_valid - a function to confirm whether a strptime format string returns a date
- dativa.tools.pandas.get_column_name - a function to return the name of a column from a passed column name or index.
- dativa.tools.pandas.get_unique_column_name - a function to return a unique column name when adding new columns to a DataFrame
dativa.tools.pandas.ParquetHandler
ParquetHandler class, specify path of parquet file, and get pandas dataframe for analysis and modification.
- param base_path : The base location where the parquet_files are stored.
- type base_path : str
- param row_group_size : The size of the row groups while writing out the parquet file.
- type row_group_size : int
- param use_dictionary : Specify whether to use boolean encoding or not
- type use_dictionary : bool
- param use_deprecated_int96_timestamps : Write nanosecond resolution timestamps to INT96 Parquet format.
- type use_deprecated_int96_timestamps : bool
- param coerce_timestamps : Cast timestamps a particular resolution. Valid values: {None, 'ms', 'us'}
- type coerce_timestamps : str
- param compression : Specify the compression codec.
- type compression : str
from dativa.tools.pandas import CSVHandler, ParquetHandler
# Read a parquet file
pq_obj = ParquetHandler()
df_parquet = pq_obj.load_df('data.parquet')
# save a csv_file to parquet
csv = CSVHandler(csv_delimiter=",")
df = csv.load_df('emails.csv')
pq_obj = ParquetHandler()
pq_obj.save_df(df, 'emails.parquet')
Related documentation
- Querying AWS Athena and getting the results in Parquet format - (more)
- Dativa tools data validation - A library to perform basic validation on incoming data files (more)
- Dativa tools pandas extensions - A set of extensions to pandas for consistent CSV processing and better date time handling (more)
- Dativa tools Athena client - The AthenaClient wraps the AWS boto client in an easy to use wrapper to create and manage tables in S3 (more)