Improve Amazon Redshift COPY performance: Don’t ANALYZE on every COPY
This is a cross-posting from https://www.intermix.io/blog/improve-redshift-copy-performance/
Table of Contents
- Introduction
- Improving Redshift COPY Performance: The COPY Command
- Improving Redshift COPY Performance: What is “UPSERT”?
- Improving Redshift COPY Performance: Eliminating Unnecessary Queries
Introduction
Amazon Redshift is an excellent choice for cloud data warehousing—but how do you move your data into Redshift in the first place, so that it can be used for queries and analysis? Redshift users have two main options:
- Copy the data into Redshift local storage by using the COPY command.
- Use Amazon Redshift Spectrum to directly query data in Amazon S3, without needing to copy it into Redshift.
In this post, we’ll discuss an optimization you can make when choosing the first option: improving performance when copying data into Amazon Redshift.
Improving Redshift COPY Performance: The COPY Command
The Amazon Redshift COPY command loads data into a table. The files can be located in an Amazon S3 bucket, an Amazon EMR cluster, a remote host that is accessed using SSH, or an Amazon DynamoDB table.
There are a few things to note about using the Redshift COPY command:
- The maximum size of a single input row from any source is 4 MB.
- Amazon Redshift Spectrum external tables are read-only; you can’t COPY to an external table.
- The COPY command appends the new data to the end of the table, without modifying any existing rows.
- In Amazon Redshift, primary keys are not enforced. This means that deduplication must be handled by your application.
Per this last note, the recommended way of deduplicating records in Amazon Redshift is to use an “upsert” operation. In the next section, we’ll take a closer look at upserts.
Improving Redshift COPY Performance: What is “UPSERT”?
UPSERT is a method of deduplicating data when copying into Amazon Redshift or other databases. An “upsert” operation merges new records with existing records using primary keys.
While some relational database management systems support a single UPSERT command, Amazon Redshift does not. Instead, Redshift recommends the use of a staging table for merging records by joining the staging table with the target table.
Below is an example of an upsert operation for Amazon Redshift:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | -- Start transaction BEGIN; -- Create a temp table to load new customer data CREATE TEMPORARY TABLE temp_customer (LIKE customer); -- Load new customer data into the staging table COPY temp_customer (custid, name, email) FROM 's3://.......' CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx'; -- Update customer email and name for existing rows UPDATE customer c SET name = t.name, email = t.email FROM temp_customer t WHERE c.custid = t.custid; -- Insert records INSERT INTO customer SELECT t.* FROM temp_customer t LEFT JOIN customer c ON t.custid = c.custid WHERE c.custid IS NULL; -- End transaction. Note that the temp table will automatically be dropped a the end of the session END; |
Improving Redshift COPY Performance: Eliminating Unnecessary Queries
By default, the Redshift COPY command automatically runs two commands as part of the COPY transaction:
- “COPY ANALYZE PHASE 1|2”
- “COPY ANALYZE $temp_table_name”
Redshift runs these commands to determine the correct encoding for the data being copied, which may be useful when a table is empty. In the following cases, however, the extra queries are useless and should be eliminated:
- Performing a COPY into a temporary table (i.e. as part of an upsert operation).
- Performing a COPY when the table already has data in it. In Redshift, the data encoding of an existing table cannot be changed. Even if the COPY command determines that a better encoding style exists, it’s impossible to modify the table’s encoding without a deep copy operation.
Extra queries can create performance issues for other queries running on Amazon Redshift. For example, they may saturate the number of slots in a WLM queue, thus causing all other queries to have wait times.
The solution is to adjust the COPY command parameters to add “COMPUPDATE OFF” and “STATUPDATE OFF”, which will disable these features during upsert operations. Below is an example of a COPY command with these options set:
1 2 3 4 5 6 7 8 9 | -- Load data into the staging table COPY users_staging (id, name, city) FROM 's3://.......' CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx' COMPUPDATE OFF STATUPDATE OFF; |
Improving Redshift COPY performance is just one way to perform Redshift performance tuning.