A lot of our clients use AWS Redshift as the ultimate destination for their data pipeline, and when Amazon launched Redshift Spectrum, our data engineering team wondered whether we could use this technology to provide high-performance throughput without having to load the data into Redshift at all.
This article covers our analysis into the different approaches that can be taken and the results of which approaches offer the best performance.
How do we currently load data?
Extract, Transform, and Load (ETL) into Amazon Redshift
The traditional ETL approach works well for many datasets, but for large data volumes, it doesn't utilize the power of a high-end data engine such as Amazon Redshift during the transformation process. In the diagram below all the transformation processing is done before the data even reaches the database:
In a pure ETL process all of the transformation is done before the data loads into the database, so the database resources can be reserved for running queries and analytics.
Extract, Load, and Transform (ELT) into Amazon Redshift
When we are assigning keys within the database and de-normalizing, for example in a star schema, we need to run transformations after we've loaded into Redshift anyway, making it more logical to perform all of our processing in the database.
Extract, Load then Transform is our most common data pipeline model, and in the final stages of ELT, we load data into staging tables where final transformations using SQL run in the database itself. We can most easily optimize this model with Redshift Spectrum as currently, we spend a significant amount of time loading the data before actually running the transformations.
ELT with Redshift Spectrum
With Amazon's Redshift Spectrum service, we can run Redshift SQL queries against data stored in an S3 data lake. We've already used Spectrum to create one-off proof of concept systems for clients for relatively small datasets, as it allows us to quickly start the data science without the need for lots of data engineering. The trade-off is that query times are longer, but for these proof-of-concept systems, this is not a problem.
To use Redshift Spectrum for ELT, we'd replace the staging tables in the database with a Spectrum schema. In theory Redshift Spectrum should allow us to streamline this process by no longer needing to load data physically into the staging tables.
In practice, everything depends on the query performance of Spectrum compared to loading the data into Redshift in the traditional manner.
Testing Redshift Spectrum performance
Amazon gives some guidance on optimizing data to improve Redshift Spectrum performance so we decided to isolate some of these factors and test them individually to see how close we could push the performance to match that of the same data loaded directly into Redshift. Amazon recommendations include:
- Using Parquet formatted data files.
- Keeping file sizes between 100 MB and 1 GB.
- Using the fewest columns possible in queries.
- Writing SQL that uses filters and aggregations that are eligible to be pushed to the Redshift Spectrum layer.
All of our queries comply with the last two points, so we concentrated on investigating file format and file size against dataset size. As we are looking to using Redshift spectrum for ELT, it is effectively acting as a staging area for smaller datasets which means that the number of rows involved is only the batch size loaded at any one time, typically up to around 5 million rows.
We therefore tested the following scenarios:
|File format||- Raw CSV|
|- GZip compressed|
|- Parquet formatted|
|File size||- Small - below 100MB|
|- Medium - above 100MB but below 1GB|
|Dataset size||- Small - approx 35,000 rows, 15 columns (VARCHARs and INTs)|
|- Medium - approx 5,000,000 rows, 15 columns (VARCHARs and INTs)|
|Query complexity||- Simple1 - two columns, where clause selects ~90% of rows|
|- Simple2 - two columns, where clause selects ~10% of rows|
|- Complex1 - join, three columns, where clause selects ~90% of rows|
|- Complex2 - join, three columns, where clause selects ~10% of rows|
Query test results
We loaded the test data directly into Redshift to give a baseline performance against which to normalize the rest of the results. We applied our test SQL to all the possible optimisation variants and tabulated the results. We ran all of the queries 100 times and averaged response times to minimize any short-term temporal effects such as cluster performance, code compilation or results caching:
The raw results can be downloaded here. The normalized values show the length of time a query took compared to the baseline (e.g., 2.0 means a Spectrum query took twice as long compared to the equivalent loaded data query or 200% of the time)
Besides confirming that these optimisation techniques do work, the results held a few surprises for us. For instance, we didn't predict that file size would be far more significant than file format in affecting query times, so this is an easy win to improve performance. Also, changing the file format is only substantial for more massive datasets, meaning for our smaller Proof of Concepts there's no value in converting the data to Parquet. For the medium-sized dataset, GZipping can be nearly as effective as Parquet under some circumstances, so this could save valuable processing time for some applications (although for even more massive datasets we expect Parquet to provide more significant optimisation).
The most notable surprise of all though, was that by optimising the file size and using Parquet formatting, it's almost possible to match native Redshift performance under some circumstances. This speed bodes well for production use of Redshift Spectrum, although the processing time and cost of converting the raw CSV files to Parquet needs to be taken into account as well.
We conclude that Redshift Spectrum can provide comparable ELT query times to standard Redshift. As the latency of loading data into Redshift can be substantial, we see Redshift Spectrum as a more performant option under some circumstances.
To get the best performance, you should:
- Split files in the hundreds of Mb.
- For batches of millions of rows, GZip and Parquet perform much better than raw CSV.
- If your data is not already in Parquet format it is probably not worthwhile transforming it just for ELT.