Boost your Workload Scalability with Smarter Amazon Redshift WLM Set Up

This is a cross-posting from https://www.intermix.io/blog/4-simple-steps-to-set-up-your-wlm-in-amazon-redshift-the-right-way/

One of the major propositions of Amazon Redshift is simplicity. It only takes minutes to spin up a cluster. The time-to-first-report, i.e. the time it takes to go from creating a cluster to seeing the results of your first query, can be less than 15 minutes. That’s true even for petabyte-scale workloads.

Because it’s so easy to set-up a cluster, however, it can also be easy to overlook a few housekeeping items when it comes to setting up Redshift. That can cause problems with scaling workloads down the road. A couple of general complaints we often hear are “slow queries in Redshift” or “slow Redshift dashboards”.

Enter Amazon Redshift workload management (WLM). Without using WLM, each query gets equal priority. As a result, some workloads may end up using excessive cluster resources and block your business-critical processes.

Here are three frequent issues we hear:

  1. Loading data in Redshift takes too long. Even with efficient copy operations from Amazon S3, it takes too long to import data at scale.
  2. Redshift queries overflow to disk and consume the entire SSD. Trying to avoid inefficient queries can seem impossible.
  3. Huge strain and contention on a Redshift cluster when data loading and querying take place at the same time.

You can help address these challenges by using our top 15 performance tuning techniques for Amazon Redshift. However, odds are that you’ll also be able to get some quick performance gains by adjusting your WLM.

In this post, we’ll recommend a few simple best practices that will help you configure your WLM the right way and avoid these problems.

WLM is the single best way to achieve concurrency scaling for Amazon Redshift. Your users will be happy (thanks to fast queries). You can scale as your data volume grows. And you’ll spend less time putting out fires and more time on core business processes.

Table of Contents

  • Understanding Amazon Redshift Workload Management
  • 4 Steps to Set Up Redshift Workload Management
  • Additional Redshift WLM Settings
  • Redshift WLM Queues: Finding the Right Slot Count and Memory Percentage

Understanding Amazon Redshift Workload Management

Amazon Redshift operates in a queueing model. The first step in setting up WLM for Redshift is to define queues for your different workloads. Next, you need to assign a specific concurrency/memory configuration for each queue.

You can define up to 8 queues, with a total of up to 50 slots. In the Amazon Redshift documentation, you’ll read to not go above 15 slots. By using the techniques in this post, however, you’ll be able to use all 50 available slots. You will also have clear visibility to see when and how you need to fine-tune your settings.

The WLM tab in the Amazon Redshift console
Image 1: The WLM tab in the Amazon Redshift console

The default configuration for Redshift is a single queue with a concurrency of 5. If you run more than 5 concurrent queries, then later queries will need to wait in the queue. That’s when the “Redshift queries taking too long” thing goes into effect.

In Redshift, the available amount of memory is distributed evenly across each concurrency slot. For example, if you have a total of 1 GB of memory, then with the default configuration, each of the 5 concurrency slots gets 200 MB.

If you run a Redshift query that needs more than 200 MB, then it falls back to disk, which means that it takes longer to execute. Disk-based queries also consume a lot of I/O operations. That slows down the entire cluster, not just queries in a specific queue.

Users then try to scale their way out of contention by adding more nodes, which can quickly become an expensive proposition. In addition, you may not see the results you want, since the performance increase is non-linear as you add more nodes.

Instead, you can achieve a much better return on your Amazon Redshift investment by fine-tuning your Redshift WLM. Configuring Redshift specifically for your workloads will help you fix slow and disk-based queries. It’s very likely that  the default WLM configuration of 5 slots will not work for you, even if Short Query Acceleration is enabled (which is the Redshift default).

You can read how our customer, Udemy, managed to go all the way to 50 slots and squeeze every bit of memory and concurrency out of their 32-node cluster in this blog post.

Here is what they wrote:

How Udemy uses all of the 50 slots available in their WLM setup to increase Redshift performance.
Read the full Udemy’s Redshift here

4 Steps to Set Up Redshift Workload Management

When users run a query in Redshift, WLM assigns the query to the first matching queue and then executes rules based on the WLM configuration.

The key concept for using the WLM is to isolate your workload patterns from each other. You can create independent queues, with each queue supporting a different business process, e.g. data loads or dashboard queries. With separate queues, you can assign the right slot count and memory percentage.

The image below describes the four distinct steps to configure your WLM.

4 steps to configure your workload management in Amazon Redshift
Image 2: 4 steps to configure your workload management in Amazon Redshift

Let’s look at each of these four steps in detail.

Step 1: Set Up Individual Users

The first step is to create individual logins for each Redshift user. A user can be a person, an app, or a process—anything that can run a query.

Separating users may seem obvious, but when logins get shared, you won’t be able to tell who is driving which workloads. Although this may not be too difficult with only a few users, the guesswork will increase quickly as your organization grows.

Set up individual users in Amazon Redshift
Image 3: Set up individual users in Amazon Redshift

Most importantly: Never use the default Redshift user for queries. First, it has administrative privileges, which can be a serious security risk. Second, you should consider the default Redshift user as your lifeline when you run into serious contention issues— you’ll still be able to use it to run queries.

If your cluster is already up and running with a few users, we recommend doing a reset: delete the old users and assign everybody new logins.

Step 2: Define Your Workloads

The next step is to categorize all users by their workload type. There are three generic types of workloads:

  1. Loads: Jobs that load data into the cluster. These workloads typically use COPY and UNLOAD statements.
  2. Transforms: Batch jobs and scheduled transformations. These workloads typically consist of INSERT, UPDATE and DELETE transactions
  3. Ad-hoc: These are queries by analysts and/or dashboards, typically consisting of SELECT statements.
how to define each login/user by their workload type in Amazon Redshift
Image 4: Define workload types

Defining users by workload type will allow you to both group them together and separate them from each other. You’ll very likely find that workloads of the same type share similar usage patterns.

Step 3: Group Users by Workload Type

We can use these similarities in workload patterns to our advantage. By grouping them, we’ll have groups of queries that tend to require similar cluster resources. For example, loads are often low-memory and high-frequency. Ad-hoc queries, on the other hand, run less frequently, but can be memory-intensive.

User groups by workload type in Amazon Redshift
Image 5: User groups by workload type in Amazon Redshift

Use the CREATE GROUP command to create the three groups ‘load’, ‘transform’ and ‘adhoc’, matching the workload types we defined for our users. Use ALTER GROUP to add the users we defined in step #2 to their corresponding group.

You can of course create more granular sub-groups, e.g. for departments such as sales, marketing, or finance. That way, you can give the users in each group the appropriate access to the data they require. However, you should still stay within the logic of workload patterns, without mixing different workload groups.

Step 4: Define Slot Count & Memory Percentage

The final step determines what slot count to give each queue, and the memory allocated to each slot.

You should keep the default queue reserved for the default user, and set it to a concurrency of 1 with a memory percentage of 1%. The default queue is your insurance in case something goes wrong—just consider the 1% of memory as a cost of doing business.

Define slot count & memory percentage
Image 6: Defining WLM slot count and memory percentage in Amazon Redshift

For the other queues, slot count and memory will determine if each query has:

  • a slot at run time
  • enough memory to execute in-memory

If both of these things are true, that’s when you get blazing fast Redshift queries and throughput. To apply the new settings, you need to create a new parameter group with the Redshift console.

Additional Redshift WLM Settings

When you apply the new settings, we also recommend activating Short Query Acceleration and Concurrency Scaling.

Even with proper queue configuration, some queries within a queue take longer to execute, and may block other short-running queries during peak volume. By using Short Query Acceleration, Redshift will route the short queries to a special “SQA queue” for faster execution.

Concurrency Scaling for Amazon Redshift gives Redshift clusters additional capacity to handle bursts in query load. It works by off-loading queries to new, “parallel” clusters in the background. Queries are routed based on your WLM configuration and rules.

Redshift WLM Queues: Finding the Right Slot Count and Memory Percentage

With your new WLM configuration, and SQA and Concurrency Scaling enabled, all that’s left now is to find the right slot count and memory percentage for your queues.

Unfortunately, that process can feel a little bit like trying to look into a black box.

AWS provides a repository of utilities and scripts for querying the system tables (STL tables and STV tables). The scripts help you to find out e.g. what the concurrency high-water mark is in a queue, or which queries fall back to disk.

There are three potential challenges, though, with using these AWS scripts:

  1. Scripts can be incomplete: Some of the information is ephemeral, as Redshift deletes logs on a rolling basis. If you don’t run the script at the right time, the information is gone.
  2. Scripts increase cluster load: Because you’re querying the system tables, you’re putting more load on the system—exactly the opposite of what you want to do when you’re experiencing contention.
  3. Scripts require maintenance: Scripts need to run and store the results. It’s almost like building another application. That requires money and the use of your most valuable resource—engineering hours.