Recently we needed to move some data around. In simple terms we needed to redistribute a larger table into S3 objects based on a shard key.

The problem was that the distribution key (let's call this company_id) was not available on the large table (let's call this order_events) we needed to export.

We explored several solutions before deciding to use Redshift to do this kind of one-time ETL. The process was pretty simple:

  1. Use DMS to load several tables into a new Redshift cluster.
  2. Run some SQL for the ETL.
  3. Sort the table by the new distribution key.
  4. Redshift COPY into S3. We needed it in S3 in this case, but it could have been loaded anywhere quite easily.

The Redshift cluster consisted of a single node with 4 vCPUs, 31 GiB RAM, 2TB magnetic disk and 0.40GB/s of I/O throughput.



Unfortunately I only realised this would make a good blog post after the DMS was complete - so I don't have specific stats on how long the initial load took. But if I remember correctly it was in the order of hours and not days.

So let's begin from the point of the data already loaded into Redshift...

select count(*) from order_events;

2,046,706,417 rows in 5.3 seconds.

Great. So Redshift is pretty good at counting rows. This is likely due to the fact that it's a column-orientated database. So the next operation should also be pretty quick:

ALTER TABLE order_events ADD COLUMN company_id INT;

Done in 9.2 seconds. Wow!

Now we need to do the UPDATE JOIN that will fill in the new company_id on the 2 billion rows:

UPDATE order_events
SET company_id = terminal.company_id
FROM terminal
WHERE terminal.numeric_id = order_events.terminal_id

Done in just under 25 minutes. That is 1.4 million rows per second. Nice.



Next we need to split the table into one S3 object for each company_id (the new attribute we just added). This would be very slow when the table has no sort key. So we need to order it by the company_id.

So far Redshift has not even broken a sweat. But sorting is an intensive process and we are using magnetic (not SSD) storage.

It is not possible to change the sort key on existing table so we have to create a similar table with the new sort key and insert all the records in to the new table.

BEGIN;

CREATE TABLE order_events_new (
-- fields removed for brevity
) sortkey (company_id);

INSERT INTO order_events_new SELECT * FROM order_events;

DROP TABLE order_events;

ALTER TABLE order_events_new RENAME TO order_events;

COMMIT;

The transaction (which is entirely consumed by the INSERT) took 7.5 hours to complete. Not bad. What is interesting is when we look at the disk space used:



The INSERT actually happens in around 30 minutes. But the remaining time is sorting the data on disk.



One other thing is to make sure we did not have duplicates. This was not strictly necessary since it came from a primary key in MySQL. But seeing what Redshift can do is fun, so here we go...

SELECT company_id, id
FROM order_events
GROUP BY company_id, id
HAVING COUNT(*) > 1;

As expected, no duplicate IDs. This query took 27 minutes.



Now we are able to export the new table to S3 with a bit of bash. And it worked a treat!