Dativa tools pandas extensions

Dativa tools pandas extensions

dativa.tools.pandas.CSVHandler

A wrapper for pandas CSV handling to read and write DataFrames with consistent CSV parameters by sniffing the parameters automatically. Includes reading a CSV into a DataFrame, and writing it out to a string. Files can be read/written from/to local file system or AWS S3.

For S3 access suitable credentials should be available in '~/.aws/credentials' or the AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY environment variables.

CSVHandler

  • base_path - the base path for any CSV file read, defaults to ""
  • detect_parameters - whether the encoding of the CSV file should be automatically detected, defaults to False
  • csv_encoding - the encoding of the CSV files, defaults to UTF-8
  • csv_delimiter - the delimeter used in the CSV, defaults to ','
  • csv_header - the index of the header row, or -1 if there is no header
  • csv_skiprows - the number of rows at the beginning of file to skip
  • csv_quotechar - the quoting character to use, defaults to "

load_df

Opens a CSV file using the specified configuration for the class and raises an exception if the encoding is unparseable. Detects if base_path is an S3 location and loads data from there if required.

Parameters:

  • file - File path. Should begin with 's3://' to load from S3 location.
  • force_dtype - Force data type for data or columns, defaults to None

Returns:

  • dataframe

save_df

Writes a formatted string from a dataframe using the specified configuration for the class the file. Detects if base_path is an S3 location and saves data there if required.

Parameters:

  • df - Dataframe to save
  • file - File path. Should begin with 's3://' to save to an S3 location.

df_to_string

Returns a formatted string from a dataframe using the specified configuration for the class.

Parameters:

  • df - Dataframe to convert to string

Returns:

  • string

Example code

from dativa.tools.pandas import CSVHandler

# Create the CSV handler
csv = CSVHandler(base_path='s3://my-bucket-name/')

# Load a file
df = csv.load_df('my-file-name.csv')

# Create a string
str_df = csv.df_to_string(df)

# Save a file
csv.save_df(df, 'another-path/another-file-name.csv')

Support functions for Pandas

  • dativa.tools.pandas.is_numeric - a function to check whether a series or string is numeric
  • dativa.tools.pandas.string_to_datetime - a function to convert a string, or series of strings to a datetime, with a strptime date format that supports nanoseconds
  • dativa.tools.pandas.datetime_to_string - a function to convert a datetime, or a series of datetimes to a string, with a strptime date format that supports nanoseconds
  • dativa.tools.pandas.format_string_is_valid - a function to confirm whether a strptime format string returns a date
  • dativa.tools.pandas.get_column_name - a function to return the name of a column from a passed column name or index.
  • dativa.tools.pandas.get_unique_column_name - a function to return a unique column name when adding new columns to a DataFrame

dativa.tools.pandas.ParquetHandler

ParquetHandler class, specify path of parquet file, and get pandas dataframe for analysis and modification.

  • param base_path : The base location where the parquet_files are stored.
  • type base_path : str
  • param row_group_size : The size of the row groups while writing out the parquet file.
  • type row_group_size : int
  • param use_dictionary : Specify whether to use boolean encoding or not
  • type use_dictionary : bool
  • param use_deprecated_int96_timestamps : Write nanosecond resolution timestamps to INT96 Parquet format.
  • type use_deprecated_int96_timestamps : bool
  • param coerce_timestamps : Cast timestamps a particular resolution. Valid values: {None, 'ms', 'us'}
  • type coerce_timestamps : str
  • param compression : Specify the compression codec.
  • type compression : str
from dativa.tools.pandas import CSVHandler, ParquetHandler

# Read a parquet file
pq_obj = ParquetHandler()
df_parquet = pq_obj.load_df('data.parquet')

# save a csv_file to parquet
csv = CSVHandler(csv_delimiter=",")
df = csv.load_df('emails.csv')
pq_obj = ParquetHandler()
pq_obj.save_df(df, 'emails.parquet')

Related documentation

  • Querying AWS Athena and getting the results in Parquet format - (more)
  • Dativa tools data validation - A library to perform basic validation on incoming data files (more)
  • Dativa tools pandas extensions - A set of extensions to pandas for consistent CSV processing and better date time handling (more)
  • Dativa tools Athena client - The AthenaClient wraps the AWS boto client in an easy to use wrapper to create and manage tables in S3 (more)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531