Optimizing Amazon Redshift for Predictive Data Analytics

Optimizing Amazon Redshift for Predictive Data Analytics

Tom Weiss, Fri 03 November 2017

Our data engineers have built a lot of data warehouses that run on Amazon Redshift. Redshift can be a great platform when efficiently used, but like many others, it's difficult to get it work well.

We've now at the stage where we've collected enough information on what works and what doesn't that it's valuable enough to pull together our best tips into this post.

Optimize your schema

Many data engineers start out dumping data into redshift with little thought to the schema and hoping the columnar database engine will magically work out the best way to query the database. We find that this makes it very to run predictive analytics and your data scientists are likely to spend more time waiting for queries to run than creating new models.

If you don't want to start a battle between your data scientists and your data engineers, we recommend an approach where we implement an efficient schema based on a standard data warehouse star schema. Although Redshift doesn't contain indexes in the same way that traditional databases do, you need to pay considerable attention to the concept of distribution keys and sort keys.

  • Distribution keys - The distribution keys define how data is distributed amongst the different nodes. You always want to have a uniform distribution of data so your distribution key should typically be set to a key that has millions of different values. This is most commonly a device ID or a user ID. Those tables that don't contain a device ID or a user ID should subsequently have a distribution style of ALL so they can efficiently be joined to the main tables

  • Sort keys - The sort keys are the closest equivalent to a traditional database index. If you have data ordered by time, then the time field should almost always be the first sort key. If there are other ranges you filter on then, these should also be included in the sort key. Beyond that, joins can be made more efficient by putting the sort key on the primary key of the table you are joining to.

  • Sort key styles - We always recommend to start off by using COMPOUND sort keys as the INTERLEAVED usually are less efficient.

  • Foreign keys - Although Amazon Redshift does not enforce foreign key constraints, they are used by the query optimizer, and as such we always include them in our table definitions.

Manage your disk space

When you are running predictive analytics on the data warehouse, your Amazon Redshift queries can be very disk intensive, and therefore good data engineers will also ensure that there is always disk space good headroom. Although it's easy to add new nodes to the data warehouse, this can become very expensive. The more data Amazon Redshift needs to read from disk, the longer it's going to take to run any queries.

Table bloat can be caused by two things. Firstly, if you've been inserting data into the table in a different order to the sort order of the table, or if you've been updating data, then you are going to need to vacuum the table.

Run the following query to check for tables with a high percentage of unsorted rows and then run a VACUUM command on that table:

SELECT
    TRIM(pgdb.datname) AS dbase_name,
    TRIM(pgn.nspname) as schemaname,
    TRIM(a.name) AS tablename,
    COALESCE(b.mbytes,0) AS megabytes,
    a.rows AS rowcount,
    CASE WHEN a.rows = 0 then 0 ELSE ROUND(100.0 * a.unsorted_rows / a.rows) END AS pct_unsorted
FROM
    (SELECT db_id ,id ,name ,SUM(rows) AS rows ,SUM(rows)-SUM(sorted_rows) AS unsorted_rows
    FROM stv_tbl_perm GROUP BY db_id,id, name ) AS a
    INNER JOIN pg_class AS pgc ON pgc.oid = a.id
    INNER JOIN pg_namespace AS pgn ON pgn.oid = pgc.relnamespace
    INNER JOIN pg_database AS pgdb ON pgdb.oid = a.db_id
    LEFT OUTER JOIN ( SELECT tbl ,COUNT(*) AS mbytes FROM stv_blocklist GROUP BY tbl ) AS b
    ON a.id=b.tbl
WHERE pgc.relowner > 1
ORDER BY 4 DESC;

If all of your tables are sorted correctly, but some of them are still large then you can look at optimizing the compression encoding. This is very simple, and you just need to run:

ANALYZE COMPRESSION <table_name>

And it will show you which columns can have space saved by changing the encoding on them. You can change the encoding by using the Amazon Redshift Column Encoding Utility provided by AWS. Note that this script will not update any views table specific permissions for users so you will need to recreate these separately and also delete the backup tables it creates.

If you've run VACUUM and have good compression on the table, then the only result is that it's merely a large table. You then got three choices:

  1. Add more nodes to the data warehouse. Although simple, this is also the most expensive option. It will resolve the space issues immediately.
  2. Look at changing the node type. If you are using DC2 nodes but you are blocked on disk space then you should look at switching on DS2 nodes. They have significantly fewer CPU resources but a lot more space.
  3. UNLOAD old data to S3 and setup Redshift Spectrum to provide access to this older data. This will slow down access to older data, but it also means you only pay for the storage in S3 which is a lot cheaper than Amazon Redshift. As most predictive analytics models are updated using historical data on an occasional basis, this can be a good approach to retain performance while managing the cost of the data warehouse.

Optimizing slow queries

However well optimized your schema is, there are always going to be users who can work out new and unusual ways to run slow queries on your data warehouse. As a general guide, if a query needs to access a lot of data (e.g., reading every record from a historical table) then it's going to be slow. Anything that doesn't require a lot of data should quickly execute if it's written correctly.

This is the query we run daily to find out which users are running the slowest queries:

SELECT
DATE_TRUNC('hour',service_class_start_time) as query_time
,u.usename
, total_queue_time / 1000000 as queue_time
, total_exec_time / 1000000 as execution_time
, (total_queue_time + total_exec_time) / 1000000 as total_time
,trim(querytxt) as sqlquery
FROM stl_query q
INNER JOIN pg_user u on q.userid = u.usesysid
INNER JOIN stl_wlm_query w on q.query = w.query and q.userid = w.userid
where starttime >= '2016-12-15 00:00'
and querytxt like 'SELECT%'
AND (total_queue_time + total_exec_time) > 60 * 1000000
ORDER BY 4 DESC

Note that Amazon Redshift only stored the first 4000 characters of each query, so you may need to contact the users directly to get the full query test.

We then run the following process

  • Check each join in the query
    • Are joins implemented correctly?
    • Where there are compound primary keys are each part of the key used in the join?
    • Are sort keys always used for filtering?
    • Where sort keys are used in join, add explicit date filters on each join
  • Check the execution plan by running EXPLAIN (SELECT SLOW QUERY...)
    • Check which part of the query is taking more cost
    • Check for any distribution DS_DIST_INNER, DS_DIST_OUTER, DS_BCAST_INNER, DS_DIST_ALL_INNER, and DS_DIST_BOTH. These will cause massive amounts of data to be shifted between nodes.
    • Try making temporary tables with the appropriate distribution style to prevent the above distributions
  • Check use of sort keys
    • Amazon Redshift does not use the sort key if it's encapsulated in a function. Look out for `WHERE DATE_PART('month', sort_key) = 12. This should be replaced with WHERE sort_key BETWEEN '2016-12-01' and '2017-01-1'

For each database, we manage we create a list of more specific things to look at when optimizing based on the exact schema, distribution, and sort keys.

Get in touch to find out how we can help

Sign up below and one of our Data Consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Genie House
Burchetts Green Road
Maidenhead
SL6 6QS

Registered in England & Wales, number 10202531