AWS Redshift Architecture: Clusters & Nodes & Data Apps
This is a cross-posting from https://www.intermix.io/blog/amazon-redshift-architecture/?
In this post, we’ll lay out the 5 major components of Amazon Redshift’s architecture.
- Data applications
- Leader nodes
- Compute nodes
- Redshift Spectrum
Understanding the components and how they work is fundamental for building a data platform with Redshift. In the post, we’ll provide tips and references to best practices for each component.
Table of Contents
- Amazon Redshift Architecture and The Life of a Query
- Data apps: More than SQL client applications
- Two Types of Nodes that make a Cluster
- Redshift Spectrum Architecture
- How to get the most out of your Amazon Redshift cluster
Since launch, Amazon Redshift has found rapid adoption among SMBs and the enterprise. In the early days, business intelligence was the major use case for Redshift.
That has changed.
Today, we still, of course, see companies using BI dashboards like Tableau, Looker and Periscope Data with Redshift. But with rapid adoption, the uses cases for Redshift have evolved beyond reporting. And that has come with a major shift in end-user expectations:
- Mission-critical: Redshift is now at the core of data lake architectures, feeding data into business-critical applications and data services the business depends on. End users expect service level agreements (SLAs) for their data sets.
- Growing data volume: On average, data volume grows 10x every 5 years. The average intermix.io customer doubles their data volume each year. End-users expect data platforms to handle that growth.
- More choice: We see a constant flux of new data sources and new tools to work with data. End-users expect to operate in a self-service model, to spin up new data sources and explore data with the tools of their choice.
The shift in expectations has implications for the work of the database administrator (“DBA”) or data engineer in charge of running an Amazon Redshift cluster. The static world is gone. Today, data sets have become so large and diverse that data teams have to innovate around how to collect, store, process, analyze and share data.
In the case of Amazon Redshift, much of that depends on understanding the underlying architecture and deployment model. It’s what drives the cost, throughput volume and the efficiency of using Amazon Redshift.
And so in this blog post, we’re taking a closer look at the Amazon Redshift architecture, its components, and how queries flow through those components. We’ll include a few pointers on best practices.
Amazon Redshift Architecture and The Life of a Query
This architecture diagram shows how Amazon Redshift processes queries across this architecture. We’re excluding Redshift Spectrum in this image as that layer is independent of your Amazon Redshift cluster. (We’ll explain that part in a bit.)
However, we do recommend using Spectrum from the start as an extension into your S3 data lake. We’ll go deeper into the Spectrum architecture further down in this post. Let’s first take a closer look at role of each one of the five components
Data apps: More than SQL client applications
In other reference architectures for Redshift, you will often hear the term “SQL client application”. And SQL is certainly the lingua franca of data warehousing.
But with the shift away from reporting to new types of use cases, we prefer to use the term “data apps”. Unlike writing plain SQL in an editor, they imply the use of data engineering techniques, i.e. the use of code/software to work with data.
Data apps run workloads or “jobs” on an Amazon Redshift cluster. There are three generic categories of data apps:
- Data integration (“Loads”): This category includes applications that move data from external data sources and systems into Redshift. Examples are Informatica, Stitch Data, Fivetran, Alooma, or ETLeap.
- Workflow orchestration (“Transforms”): These are systems that run batch jobs on a predetermined schedule. For example, once data is in a cluster you will still need to filter, clean, join or aggregate data across various sources. Examples for these tools in the open source are Apache Airflow, Pinball or Luigi.
- Analysis (“Ad-hoc”): These are apps for data science, reporting, and visualization. Examples are Tableau, Jupyter notebooks, Mode Analytics, Looker, Chartio, Periscope Data. Ad-hoc queries might run queries to extract data for downstream consumption, e.g. for a machine learning application or a data API.
What does this mean for the DBA?
The Amazon Redshift architecture is designed to be “greedy”. A query will consume all the resources it can get. To protect workloads from each other, a best practice for Amazon Redshift is to set up workload management (“WLM”). WLM is a key architectural requirement. Setting up your WLM should be a top-level architecture component.
Two Types of Nodes that make a Cluster
A “cluster” is the core infrastructure component for Redshift, which executes workloads coming from external data apps. There are two key components in a cluster:
- Compute Nodes: A cluster contains at least one “compute node”, to store and process data.
- Leader Node: Clusters with two or more compute nodes also have a “leader node”. The leader coordinates the distribution of workloads across the compute nodes. A cluster only has one leader node.
In our experience, most companies run multi-cluster environments, also called a “fleet” of clusters.
What does this mean for the DBA?
It’s easy to spin up a cluster, pump in data and begin performing advanced analytics in under an hour. That makes it easy to skip some best practices when setting up a new Amazon Redshift cluster. Read more at 3 Things to Avoid When Setting Up an Amazon Redshift Cluster
The leader node has four major roles:
- Communication with data apps: When running workloads on a cluster, data apps interact only with the leader node. The compute nodes are transparent to external data apps.
- Distribution of workloads: The leader node parses queries, develops an execution plan, compiles SQL into C++ code and then distributes the compiled code to the compute nodes.
- Caching of query results: When a query is executed in Amazon Redshift, both the query and the results are cached in the memory of the leader node, across different user sessions to the same database. When query or underlying data have not changed, the leader node skips distribution to the compute nodes and returns the cached result, for faster response times.
- Maintenance of catalog tables: The system catalogs store schema metadata, such as information about tables and columns. System catalog tables have a PG prefix. A query that references only catalog tables or that does not reference any tables, runs exclusively on the leader node.
What does this mean for the DBA?
In some cases, the leader node can become a bottleneck for the cluster. The pattern is an increase in your COMMIT queue stats. For example, larger nodes have more metadata, which requires more processing by the leader node. You can Query STL_COMMIT_STATS to determine what portion of a transaction was spent on commit and how much queuing is occurring.
The compute nodes handle all query processing, in parallel execution (“massively parallel processing”, short “MPP”).
Amazon Redshift provides two categories of nodes:
- Dense compute nodes come with solid-state disk-drives (“SDD”) and are best for performance intensive workloads.
- Dense storage nodes come with hard disk drives (“HDD”) and are best for large data workloads.
As your workloads grow, you can increase the compute and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.
Compute nodes are also the basis for Amazon Redshift pricing. You can start with hourly on-demand consumption. Prices for on-demand range from $0.25 (dense compute) to $6.80 per hour (dense storage), with discounts of up to 69% for 3-year commitments.
What does this mean for the DBA?
The execution speed of a query depends a lot on how fast Redshift can access and scan data that’s distributed across nodes. A best practice is to choose the right distribution style for your data by defining distribution keys.
Adding nodes is an easy way to add more processing power. It’s also an easy way to address performance issues – by resizing your cluster and adding more nodes. Many Redshift customers run with over-provisioned clusters. Because nodes are the basis for pricing, that can add up over time. And removing nodes is a much harder process. But it’s also the only way to reduce your Redshift cost.
Redshift Spectrum Architecture
In some cases, it may make sense to shift data into S3. The cost of S3 storage is roughly a tenth of Redshift compute nodes. With Amazon Redshift Spectrum you can query data in Amazon S3 without first loading it into Amazon Redshift.
Image 2 shows what an extended Architecture with Spectrum and query caching looks like. The leader nodes decides:
- if a cached result is returned
- what part of the query
- runs locally in the cluster
- gets sent to Amazon Redshift Spectrum
The leader node includes the corresponding steps for Spectrum into the query plan. The compute nodes in the cluster issue multiple requests to the Amazon Redshift Spectrum layer.
Spectrum scans S3 data, runs projections, filters and aggregates the results. Spectrum sends the final results back to the compute nodes. The compute nodes run any joins with data sitting in the cluster. That way, you can join data sets from S3 with data sets in Amazon Redshift.
Redshift pricing is based on the data volume scanned, at a rate or $5 per terabyte.