This is a cross-posting from https://www.intermix.io/blog/top-14-performance-tuning-techniques-for-amazon-redshift/
Amazon Redshift is a data warehouse that makes it fast, simple and cost-effective to analyze petabytes of data across your data warehouse and data lake. Amazon Redshift can deliver 10x the performance of other data warehouses by using a combination of machine learning, massively parallel processing (MPP), and columnar storage on SSD disks.
But even with all that power, it’s possible that you’ll see uneven query performance or challenges in scaling workloads. Performance optimization for Amazon Redshift is a matter of doing some thoughtful up-front planning and ongoing monitoring as your data volume, users and cluster grow.
Running a Cluster that’s Fast, Cheap and Easy to scale
In this article, we’re giving you our 15 best practices for performance tuning Redshift. With these practices, you’ll have a cluster that is faster, cheaper, and easier to scale than any other product on the market.
Here are the 15 performance techniques in summary:
Table of Contents
- Create Custom Workload Manager (WLM) Queues
- Use Change Data Capture (CDC)
- Use Column Encoding
- Don’t ANALYZE on Every COPY
- Don’t Use Redshift as an OLTP Database
- Use DISTKEYs Only When Necessary to Join Tables
- Maintain Accurate Table Statistics
- Write Smarter Queries
- Avoid Row Skew
- Use Short Query Acceleration (SQA)
- Compress Data in S3
- Manage Very Long Tables
- Use RA3 Nodes for Independent Compute and Storage Scaling
- Use Amazon Redshift Spectrum for Infrequently Used Data
- Use Columnar Formats for S3 Data
- Get the Most Out of Redshift
How we Use Amazon Redshift
At intermix.io, we use Amazon Redshift as part of our core platform. This blog post compiles our learnings from over three years of operating several large Redshift clusters at a high scale.
intermix.io is an analytics platform that provides a single monitoring dashboard for data engineers to keep an eye on their mission-critical data flows.
intermix.io uses Amazon Redshift for batch processing large volumes of data in near real-time. Our data pipeline processes over 20 billion rows per day. We serve data from Amazon Redshift to our application by moving it into RDS and Amazon Elasticsearch Service.
Create Custom Workload Manager (WLM) Queues
The Amazon Redshift Workload Manager (WLM) is critical to managing query performance. Amazon Redshift runs queries in a queueing model. The default WLM configuration has a single queue with five slots. Almost 99% of the time, this default configuration will not work for you and you will need to tweak it. Configuring the WLM for your workloads provides two main benefits:
- Scaling workloads by giving them enough resources (e.g. concurrency and memory)
- Isolating and protecting your predictable workloads (i.e. batch operations) from your unpredictable workloads (i.e. ad hoc queries from reporting tools)
You can have up to 8 queues with a total of up to 50 slots. A query will run in a single slot, by default. Queries can be routed into queues using certain rules. Setting up your WLM the right way will eliminate queue wait times and disk-based queries.
To set-up your WLM for your workloads, we recommend following a four-step process:
- Separate users
- Define workloads
- Group users into workloads
- Select slot count & memory % per queue
Our guide to setting up Redshift WLM to improve performance walks you through our four-step process to eliminate queue wait times and reduce disk-based queries. Both slow your cluster down, so let’s take a closer look at this Redshift performance tuning technique.
Eliminate queue wait times by matching queue slot count to peak concurrency
If you’ve used Redshift for any period of time, you may have come across a situation where a query that used to run for two seconds starts running much slower. The most common reason for this is queuing. The query was waiting in a queue because the number of slots in the cluster was too low for the number of concurrent queries that were executing.
The default configuration allows you to run five concurrent queries in one queue. That means if five queries are executing, the sixth one will queue until a slot becomes available.
The goal is to ensure that queries are not waiting in the queue. This can be done by matching the slot count of the queue with the actual concurrency of the queries running in that queue.
You can eliminate queue wait times by:
- Increasing the slot count for you queues
- Reducing concurrency by distributing queries more evenly throughout the day.
There is another benefit to this approach – you can use Short Query Acceleration for Amazon Redshift (“SQA”) the right way and avoid the downside of SQA. Activating SQA consumes memory within the cluster – which brings us to disk-based queries.
Reduce disk-based queries by assigning enough memory to your queues
Increasing slot count to eliminate queuing can have an adverse side effect: disk-based queries. “Disk-based” means that the query runs out of RAM, and begins using the hard drive. Queries go disk-based because the query memory exceeds the ‘memory per slot’ in that queue. The memory per slot is calculated as:
memory assigned to that queue / # of slots
Since each queue is assigned a fixed percentage of a cluster’s memory (a value you’ll set when you configure your WLM queue), adding more slots will decrease the memory per slot.
Disk-based queries cause two major problems:
- Queries slow down because they need more I/O
- Concurrency goes up which causes more queueing.
When the frequency of disk-based queries goes up, a chain reaction can occur. More I/O causes more CPU, which in turn make queries run slower, increasing overall concurrency.
As a rule of thumb, maintain your queues such that fewer than 10% of queries go disk-based.
With our Throughput and Memory Analysis dashboards in intermix.io, we make finding the right slot count and memory percentage easy. When you can see the relevant metrics in an intuitive, time-series dashboard, allocating the right slot count and memory percentage for each queue becomes simple.
Use Change Data Capture (CDC)
The Amazon Redshift COPY command takes advantage of the parallel architecture and is the recommended way of moving data into Redshift. The COPY command is optimized, but the COPY operation is still expensive. The best practice is to only copy rows that you need.
The goal is to minimize the number of rows ingested. The best way to do this is to ensure that your ETL tools are only COPYing in data that has changed since the last time. Otherwise, you will have two issues:
- Frequent spikes in disk utilization which requires more free capacity
- Deleting redundant data (deduplication) which uses I/O and increases the need to run VACUUM operations.
Here is an example of a CDC operation:
Use Column Encoding
Adding compression to large, uncompressed columns has a big impact on cluster performance. Compression accomplishes two things:
- Reduce storage utilization. File compression reduces the size footprint of data, which means you use less of the disk on your cluster nodes.
- Improve query performance. There is fewer data to scan or join on, and I/O usage is limited which increases query speeds.
We recommend using the Zstandard (ZSTD) encoding algorithm. This relatively new algorithm provides a high compression ratio and works across all Amazon Redshift data types. ZSTD is especially good with VARCHAR and CHAR fields that have a mixture of long and short strings. Unlike some of the other algorithms, ZSTD is unlikely to increase storage utilization.
Here is a real-world example of applying ZSTD to three Amazon Redshift logging tables. The average storage reduction is over 50%!
Don’t ANALYZE on Every COPY
The Amazon Redshift COPY command loads data into a table. The default behavior of Redshift COPY command is to run two commands:
- “COPY ANALYZE PHASE 1|2” and
- “COPY ANALYZE $temp_table_name”
Amazon Redshift runs these commands to determine the correct encoding for the data being copied. This may be useful when a table is empty. But in the following cases, the extra queries are useless and should be eliminated:
- When COPYing into a temporary table (i.e. as part of an UPSERT)
- When the table already has data in it. For an existing table, encoding cannot change. So even if the COPY command determines that a better encoding style exists, it’s impossible to modify the encoding of the table without doing a deep copy operation.
In the example below, a single COPY command generates 18 ‘analyze compression’ commands and a single ‘copy analyze’ command.
Extra queries can create performance issues for other queries running on Amazon Redshift. They increase concurrency and hence, may saturate the number of slots in a WLM queue, causing other queries to have queue wait times.
The solution is to adjust the COPY command parameters to add “COMPUPDATE OFF” and “STATUPDATE OFF”. These parameters will disable these features during “UPSERT”s.
Here is an example of a “COPY” command carried out with those settings:
Don’t Use Redshift as an OLTP Database
It is common to connect an application framework like Django to Amazon Redshift. This is useful when using Redshift data in your application, i.e. in an OLTP scenario. However, since Amazon Redshift is an OLAP database, there is a chance it might not handle these queries well.
The challenge of using Redshift as an OLTP database is that queries can lack the low-latency that exists on a traditional RDBMS. Unlike OLTP databases, OLAP databases do not use an index. This is a result of the column-oriented data storage design of Amazon Redshift, which makes the trade-off to perform better for big data analytical workloads.
Consider this example from a live production cluster. The user ‘django_redshift’ is querying the table ‘search_word_level_course_vector”, a table with 443,744 rows. The query ran 374,372 times. Each query returned a single row.
Each query scans all 443,744 rows, takes about 0.02 seconds to run and returns a single row.
The impact on the cluster is quite dramatic:
- 374,371 queries @ 0.02s per query equal 7,487 seconds, or 125 minutes of query time. The commit queue backs up with all these requests, impacting the execution time of all other queries running in the cluster
- The query volume drives up concurrency and may exceed the number
#of available WLM slots, which results in queue wait times for other queries running in that queue.
There are two approaches to solve the problem:
- Re-write the queries to select all 443,744 rows of the table, and then parse each row in application memory. Doing so would remove 374,371 queries from your Redshift database. Such a single query would take just a few seconds, instead of 125 minutes.
- Use Amazon RDS and DBLINK to use Redshift as an OLTP.
Use DISTKEYs Only When Necessary to Join Tables
Distribution style is a table property that decides how to distribute rows for a given table across the nodes in your Amazon Redshift cluster. Choosing the correct distribution style is important for query performance.
There are two major considerations to keep in mind when choosing a distribution style:
- Minimize data movement across nodes, which is expensive because of network I/O and disk I/O.
- Distribute data evenly across your cluster to maximize query performance and minimize row skew. We will cover ‘row skew’ below.
The default distribution style is ‘EVEN’. All nodes contain an equal number of rows for a given table. The benefits of the ‘EVEN’ distribution style are:
- Table scans are fast since all nodes have the same workload
- Disk utilization of nodes is the same since there is no row skew. We explain what ‘row skew’ is below.
However, ‘EVEN’ distribution is not optimal when joining two tables. Consider what happens when two tables are JOINed:
- Select data for table 1
- Select data for table 2
- Move data to a single node (co-located)
- Join data on that node and store results
and the query execution continues from here.
With EVEN distribution, it’s easy to see that step 3 requires the movement of data between nodes. This is not ideal because it requires network (broadcast) traffic and increases I/O utilization across the cluster. Both factors increase query latency.
KEY-based distribution to make JOINs faster
To solve this problem and make JOINs faster, Amazon Redshift offers a KEY-based distribution style. With KEY-based distribution, Amazon Redshift will ensure that for a given column across two tables, step 3 (move data to a single node) will not be necessary. This is accomplished by applying an algorithm when writing data to nodes. The algorithm ensures that rows with the same value in the ‘DISTKEY’ column end up on the same node.
Consider an example where the name of the JOIN column is ‘customer_id’.
- The DISTKEY for table 1 must be “customer_id”
- The DISTKEY for table 2 must be “customer_id”
- Query 1 joins on table 1 and table 2 on “customer_id”
In this case, Query 1 will execute faster than the case when table 1 or table 2 uses an EVEN-based distribution.
Downsides of KEY-based distribution
But what happens when you run another type of query against table 1? For example, a query that does not join on “customer_id” but on another column? Or does not do a JOIN at all? Queries which do not JOIN on these columns may run much slower.
There are two main downsides of using KEY based distribution.
- Uneven node disk utilization: Row skew happens when you use KEY based distribution for a table, and the values in the DISTEY column are not evenly distributed. The result is that a node ends up having more rows for that table.
- Slower queries: With different row counts, all other queries, like a straight SELECT, which touch that table will be a little slower. Since one node has more data than the next, the query execution must wait for the “slowest” node” (i.e. the one with the most rows) to send up its data to the leader node.
When to use KEY-based distribution
KEY-based distribution is great if and only if you have a major query that you want to optimize. In all other cases, use an EVEN-based distribution. Using EVEN distribution will:
- eliminate row skew
- ensure SELECTs of that table are optimized
Maintain Accurate Table Statistics
Amazon Redshift builds a custom query execution plan for every query. For a given query plan, an amount of memory is allocated. The memory allocation is determined by estimating the amount of memory needed to store intermediate query results (as in a JOIN or aggregation).
The query plan allocates a certain amount of memory to each query by estimating the amount of memory needed to store intermediate results (e.g. for a JOIN or aggregation).
It is important for a query to have sufficient memory to not spill to disk (go “disk-based”). Allocating too much memory is not desirable, either. Queries do not share memory. Allocating more memory than needed wastes memory since it is unavailable to other queries.
Here, it is important to note that the system is not adaptive. If the plan was wrong and the query needs more (or less) memory than was allocated – the execution engine will not go back and adjust the memory allocation after the query has already started executing.
What could cause the plan to be wrong? Very often, it is the number of rows in a table.
The ANALYZE command will ensure that the planner has an accurate, up-to-date view of the row counts for tables. Let’s look at an example of what happens if the statistics are wrong.
EXAMPLE 1 – Table has more rows than the planner thinks it has
Let’s say the planner allocates too little memory to the query. Once the query starts running, it will encounter that it requires more memory than it was allocated. The query will go disk-based and thus will run slower than otherwise.
This could have been avoided by running the query in a slot with enough memory.
EXAMPLE 2 – Table has fewer rows than the planner thinks it has
Now, let’s assume that the planner allocates too much memory to the query. Once the query starts running it will encounter that it requires less memory to store intermediate results than it was allocated. It will lead to either of these two results:
- The query will not go disk-based. However, but it used up too much memory. That may cause other queries to go disk-based.
- The query was allocated more memory than was available in the slot it ran in, and the query goes disk-based. This could have been avoided with up-to-date statistics.
Amazon Redshift provides statistics called “stats off” to help determine when to run the ANALYZE command on a table. The “stats off” metric is the positive percentage difference between the actual number of rows and the number of rows seen by the planner.
As a best practice, we recommend running ANALYZE on any tables with a “stats off” percentage greater than 10%.
Write Smarter Queries
Amazon Redshift is a distributed, shared-nothing database that scales horizontally across multiple nodes. Query execution time is very tightly correlated with:
- the # of rows and data a query processes.
- the amount of data moving between nodes.
Below is an example of a poorly written query, and two optimizations to make it run faster.
Optimization #1: Limit Rows Processed by using a WHERE clause
Queries can run faster by minimizing the amount of data moving between nodes. In practice, this means being careful when writing multi-stage queries where the results of one stage feeds into the next.
In the case of our example query, modifying your ‘WHERE’ clauses to only select rows needed will minimize the amount of data that needs to be moved around and speed up the query.
Optimization #2: Limit Columns Scanned
Amazon Redshift is a column-oriented database. As a result, scanning a table doesn’t read each row in its entirety. Instead, individual columns can be scanned without needing to read other columns. You should be careful to only select columns that you will use for your query. Try to avoid using a
operation in all cases.
The two optimizations can dramatically improve your query speeds.
Re-writing queries is easier said than done, though. That’s why, we’ve built “Query Insights” into our product, a dashboard that gives you recommendations on how to re-write queries to drastically improve speed (a few customers went from 30min execution time to 30 seconds).
You can also send an email to your users directly from intermix.io dashboard to let them know what can be changed about a certain query.
Avoid Row Skew
Row Skew happens when a table uses KEY based distribution, and the values in the DISTKEY column are not evenly distributed. The row skew metrics is a positive integer ranging from 1 to the number of rows in the table. Row skew is the ratio of:
- number of rows on the node containing the most number of rows for the table
- number of rows on the node containing the least number of rows for the table
High row skew results in uneven node disk utilization (cost) and slower queries (performance).
The chart below shows a real-world example. With uneven disk utilization, a single node(s) ends up having more rows for that table. This can be a major (cost) problem if you need to add more nodes in your cluster just because a single node is skewed.
With high row skew, doing a straight SELECT on that table will be slower than otherwise. This is because one node has more data than the next, and the query execution must wait for the “slowest” node to send up its data to the leader.
There are two options to eliminate row skew:
- selecting a DISTKEY that is random, or
- change the distribution style to EVEN or ALL
The exception to tolerate row skew is if – and only if – you make a conscious decision to optimize a single query. See the section “Use DISTKEYs Only When Necessary” in this article for more information.
Use Short Query Acceleration (SQA)
Short Query Acceleration (SQA) will speed up the execution of short running queries. It does so by selecting certain queries to jump the queue. This can be useful when your cluster runs a mixture of big and small queries. In this case, a small query that would otherwise queue up behind a longer query will execute first.
SQA is enabled by default on Amazon Redshift clusters. But using SQA without any other adjustments to your cluster is not a recipe for success. There are other levers to pull first. See our quick guide to using Short Query Acceleration and WLM for Amazon Redshift for faster queries.
Compress Data in S3
The Amazon Redshift COPY command is the recommended way of moving data into Amazon Redshift. The COPY command takes advantage of the parallel architecture in Amazon Redshift to move data. The COPY command can read files from various sources, including EMR, DynamoDB, and remote hosts via SSH.
Compressing files in S3 when loading large amounts of data will accomplish three goals:
- Faster file upload to S3
- Lower S3 storage utilization (cost)
- Faster load process since uncompression un-compression can happen as files are read.
Long-running COPY commands will see the most improvement with this performance tuning technique for Redshift.
Manage Very Long Tables
Amazon Redshift is very good for aggregations on very long tables (e.g. tables with > 5 billion rows). Some use cases call for storing raw data in Amazon Redshift, reducing the table, and storing the results in subsequent, smaller tables later in the data pipeline.
This is a great use case in our opinion. However, managing very large tables presents two challenges:
- Pruning (i.e. deleting historical data) can be very expensive.
- Sorting the long table can be very expensive (or not possible)
This section discusses a few approaches to managing these issues for long tables more efficiently.
Use UNION to make it easier to PRUNE very long tables
Pruning a long table requires running the DELETE operation. This needs to be done rather frequently to avoid the table filling up your disk.
After every DELETE operation, you need to run the following three maintenance steps on the table:
- Reclaim space on the table
- Update statistics on the table
On a very long table, these operations can be very expensive.
To avoid the three steps, you can partition the very long table into smaller tables. Create multiple tables with the same schema, but with different table names. The rows in the table are then partitioned based on the chosen partition key. The job that INSERTs into these tables must be aware of the partitioning scheme.
To select from this table, create a view (with the original table name) and use the UNION directive to provide a consistent view to your application. This has the following benefits:
- The application doesn’t need to care about the partitioning, since the VIEW presents the same table name
- Pruning is simply a matter of dropping the “oldest” table. You don’t need to run VACUUM. Drop operations are very inexpensive and reclaim space immediately.
There is a downside to this approach, though. SELECTs on the table will go a bit slower since the UNION operation won’t be as fast as scanning a single table. But depending on your environment, it can be a small trade-off to avoid the pain of maintaining a very long table.
COPY in sort order
We’ve learned that sorting is an expensive operation. If you use an UPSERT method to COPY new data into a table, you will need to sort that table.
UPSERT is a method of de-duplicating data when copying into Amazon Redshift. The UPSERT operation merges new records with existing records using primary keys. While some RDBMSs support a single “UPSERT” statement, Amazon Redshift does not support it. Instead, you should use a staging table for merging records.
Since UPSERT performs a DELETE, it may leave the table in an unsorted state.
One approach to eliminate the need to sort the table is to COPY in sort order.
There are a few caveats when considering using this method:
- It only works for COPYs (not regular inserts)
- Using a manifest is problematic because the ordering of files isn’t guaranteed
- The table can have only one sort key (interleaved style is not supported)
- The sort column should be NOT NULL and the table should be 100% sorted (or empty)
- New rows are higher in sort order than existing rows, including rows marked for deletion.
Use RA3 Nodes for Independent Compute and Storage Scaling
In 2019, Amazon introduced RA3 nodes for Redshift. They use large SSDs for local caching coupled with automatic data eviction, data pre-fetching, and other data management techniques. With RA3 nodes, you can separate compute and storage. You can scale your clusters according to your compute needs, which brings down your overall cost of analytics.
The basic idea behind RA3 nodes is to use S3 for storing all permanent data and use the local disk for caching. You can fetch data from S3 on-demand. Additionally, Redshift identifies data that is used frequently – hot data – and keeps it local for fast compute times. You can create RA3 node clusters via the AWS management console.
Use Amazon Redshift Spectrum for Infrequently Used Data
Amazon Redshift launched with disruptive pricing. To compare the cost, we’re looking at the price for storing 1TB of data for one year ($ / TB / Year). With a 3-year commitment for the ds2.8xlarge nodes, the price comes down to $934 / TB / Year. That price point is unheard of in the data warehousing world.
The average Amazon Redshift customers double their data every year. In fact, that is one of the reasons why it’s important to focus on performance improvements – since managing performance becomes a bigger challenge as data volume grows.
At some point, the cost of storing all this data in Amazon Redshift becomes prohibitive. Keeping a multi-year history of data “forever” can become expensive. Deleting data may not be an option due to regulatory reasons.
Amazon Redshift prices are based on the size of your cluster. In other words, compute and storage are coupled. You’ll have to keep adding nodes for storage, even though you may not need the additional computing power of the vCPUs.
However, since storing data is cheap on Amazon Redshift, a common initial behavior is to store all historical raw data in Redshift. But data volume is growing. You may also want to use the faster but more expensive dense compute nodes. Many companies don’t want to make a capital commitment beyond a 1-year term.
Enter Amazon Redshift Spectrum. With Redshift Spectrum, you can leave data as-is in your S3 data lake, and query it via Amazon Redshift. In other words, you can de-couple compute from storage. This approach makes sense when you have data that doesn’t require frequent access. Leave your “hot” data in Amazon Redshift, and your “cold” data in S3.
The impact on cost can be substantial. The price for S3 Standard Storage is $281 / TB / Year. Thus, with Redshift Spectrum, you get the best of both worlds. We call it “data tiering”. You get to keep all your historical data, along with the performance of Amazon Redshift. With Redshift Spectrum you can benefit from the cost savings of using S3.
In “Amazon Redshift Spectrum: How Does It Enable a Data Lake?”, we’re taking an even closer look at using Redshift as part of a data lake architecture, including the use of Amazon Athena and AWS Glue. Talking of Redshift Spectrum, here is a bonus tip to fine-tune the performance of your Redshift cluster.
Use Columnar Formats for S3 Data
It is more efficient to store S3 data in columnar formats, such as Apache Parquet. Columnar formats deliver better performance when compared to row-based formats. There is a factor for cost-efficiency, too. With columnar formats, Redshift Spectrum scans only the columns that have the required data records. It does not read all the columns. Since you pay for the volume of data scanned, Apache Parquet helps cut down your data analysis costs.
You can even further optimize Parquet by using frequently filtered columns to sort data, thus enhancing the cost-efficiency of Redshift and Redshift Spectrum.
Get the Most Out of Redshift
Amazon Redshift is a cloud-based data warehouse that offers high performance at low costs. But uneven query performance or challenges in scaling workloads are common issues with Amazon Redshift. Use the performance tuning techniques for Redshift mentioned here to lower the cost of your cluster, improve query performance, and make your data team more productive.