Dativa tools pandas extensions

Dativa tools pandas extensions

An easy to use module for converting csv files on s3 to parquet using aws glue jobs. For S3 access and glue access suitable credentials should be available in '~/.aws/credentials' or the AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY environment variables.

S3Csv2Parquet

Parameters:

  • region - str, AWS region in which glue job is to be run
  • template_location - str, S3 bucket Folder in which template scripts are located or need to be copied. format s3://bucketname/folder/file.csv
  • glue_role - str, Name of the glue role which need to be assigned to the Glue Job.
  • max_jobs - int, default 5 Maximum number of jobs the can run concurrently in the queue
  • retry_limit - int, default 3 Maximum number of retries allowed per job on failure

convert

Parameters:

  • csv_path - str or list of str for multiple files, s3 location of the csv file format s3://bucketname/folder/file.csv Pass a list for multiple files
  • output_folder - str, default set to folder where csv files are located s3 location at which paraquet file should be copied format s3://bucketname/folder
  • schema - list of tuples, If not specified scema is inferred from the file format [(column1, datatype), (column2, datatype)] Supported datatypes are boolean, double, float, integer, long, null, short, string
  • name - str, default 'parquet_csv_convert' Name to be assigned to glue job
  • allocated_capacity - int, default 2 The number of AWS Glue data processing units (DPUs) to allocate to this Job. From 2 to 100 DPUs can be allocated
  • delete_csv - boolean, default False If set source csv files are deleted post successful completion of job
  • separator - character, default ',' Delimiter character in csv files
  • withHeader- int, default 1 Specifies whether to treat the first line as a header Can take values 0 or 1
  • compression - str, default None If not specified compression is not applied. Can take values snappy, gzip, and lzo
  • partition_by - list of str, default None List containing columns to partition data by
  • mode - str, default append Options include: overwrite: will remove data from output_folder before writing out converted file. append: Will write out to output_folder without deleting existing data. ignore: Silently ignore this operation if data already exists.

Example

from dativa.tools.aws import S3Csv2Parquet

# Initial setup
csv2parquet_obj = S3Csv2Parquet("us-east-1", "s3://my-bucket/templatefolder")

# Create/update a glue job to convert csv files and execute it
csv2parquet_obj.convert("s3://my-bucket/file_to_be_converted_1.csv")
csv2parquet_obj.convert("s3://my-bucket/file_to_be_converted_2.csv")

# Wait for completion of jobs 
csv2parquet_obj.wait_for_completion()

Related documentation

  • Querying AWS Athena and getting the results in Parquet format - (more)
  • Dativa tools data validation - A library to perform basic validation on incoming data files (more)
  • Dativa tools pandas extensions - A set of extensions to pandas for consistent CSV processing and better date time handling (more)
  • Dativa tools Athena client - The AthenaClient wraps the AWS boto client in an easy to use wrapper to create and manage tables in S3 (more)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531