This article references the number table in a previous article: http://elliot.land/post/redshift-does-not-support-generate-series
What Is It?
It is common practice in data warehousing and reporting to use date and time dimension tables.
A date dimension table assigns an index to each day from an arbitrary starting point. Where that starting point is depends on how far back you will need to go. Usually picking a date at or slightly before the earliest records in your database is a good choice.
A time dimension table is similar in that it will assign an index to individual seconds for one day (but have no date component). Used in combination the date dimension table these two integer indexes represent an exact second for any date.
Each will have their own separate table that contains the date/time ID and any information about that particular day or second.
Why Is It Needed?
Let's say we have a table that contains sales information; lot's of it. How would you go about handling reports that needed to filter or summarise (group) on?
- Only business days (Mon-Fri).
- All days except the last day of each month.
- Specific week numbers of the year.
- A fiscal quarter.
Trying to do these date calculations in your queries raises a few problems:
- Date calculations are complicated in the easiest cases and making sure you handle all the edge cases is very complicated and error prone.
- Once you start performing date calculations it's very unlikely the database will be able to make optimisations or smart decisions that allow it to use an index. If a full table scan is required and each record needs to be calculated for the date function it can make queries extremely expensive.
I See. How Do I Set This Up?
Although Redshift is advertised as having almost the same features as PostgreSQL there are a few big missing features that make it very difficult to generate a date dimension table:
- generate_series() is not supported. We need this to generate all of the initial data.
- Only the leader node can perform any useful date calculations. This returns a lot of errors when we do any calculation involving a current timestamp.
Despite the obstacles we can still do it, here is the table definition:
And we populate it:
Now we can easily answer the previous examples:
- Only business days (Mon-Fri): day_is_weekday = 1
- All days except the first day of each month: day_is_last_of_month <> 1
- Specific week numbers of the year: year_week_number BETWEEN 32 AND 36
- A fiscal quarter: fiscal_year_number = 2017 AND fiscal_qtr_number = 3
Here is a complete example:
You should use saledateid as a part of your SORTKEY. If you are not using Redshift there should be an index on saledateid.