Querying AWS Athena and getting the results in Parquet format

Querying AWS Athena and getting the results in Parquet format

Tom Weiss, Wed 15 August 2018

At Dativa, we use Athena extensively to transform incoming data, typically writing data from the Athena results into new Athena tables in an ETL pipeline.

This is relatively straightforward to implement but it can be costly because Athena does not currently supporting executing queries to a compressed format. In fact the CSV output format created by Athena is not very friendly as an input format and requires custom SerDe configurations even to read it.

We have therefore extended the AthenaClient in our Dativa Tools package to add support for post-processing query results with a Glue job to transform the results into Athena.

It's simple to run, as the example script shows:

from dativa.tools.aws import S3Csv2Parquet, AthenaClient

scp = S3Csv2Parquet(region="us-east-1" template_location="s3://my-s3-path/template/")

ac = AthenaClient("us-east-1", "my db name", s3_parquet = scp)

ac.add_query(sql="SELECT * FROM my_table",
                 name="test parquet",
                 output_location="s3://my-s3-path/name/",
                 parquet=True)

ac.wait_for_completion()

ac.create_table({"S3Targets": [{"Path": "s3://my-s3-path/name"]}, table_name="name")

You can install dativa tools from PyPI using pip install dativatools

Enjoy!

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you

Other articles about Data Engineering


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531