An important subject left out in many Business Intelligence training courses is the Slowly Changing Dimension, or SCD. When I teach BI classes, I make a point of covering the topic, and it probably gets students more excited about BI than any other topic. And believe me, that’s easier said than done. “Excited” and “Business Intelligence” don’t always fit in the same sentence.
We know a data warehouse is made up of a relational design involving facts, dimensions, attributes, and hierarchies arranged in a “Star” or “Snowflake” schema. Facts are typically numeric values we use to assess the business such as sales revenue, costs, profit margins etc. If facts are what we want to measure, then dimensions like sales revenue by year, quarter, month, customer, or region are how we want to analyze the facts. Attributes are added to the dimension tables to flesh out the dimension in order to give it more meaning. For example, attributes could be customer demographics such as age or salary range. Hierarchies are multiple dimensions related to each other such as year/quarter/month or customer/region.
So what’s a Slowly Changing Dimension? As the name suggests, it’s a dimension that changes slowly and predictably. The big decision we have to make is do we care? When a customer moves from one region to another, what should happen to their previous orders? If we are not careful, they appear under the new region and go missing from the old one. Not bad if you are a sales rep for the new region but disastrous if you are the sales rep who actually made those sales. Another example is analyzing schools. When a student moves to another school should their previous exam results apply to the new school? Of course not (depends on whether they are a good or bad student I hear you say? Shame on you…). We want to be accurate in all such cases, but with standard dimensions we may fall into the trap of losing sight of that history.
At the dimension attribute level there are at least 3 types of SCD. Type 1 SCD means the attribute is a “changing attribute”, but we only care about the most current value. Type 2 means it’s a “historical attribute” and we very much care about maintaining historical accuracy. Type 3 is for the rare attribute where we only care about the original and current value but not those changes in between, sometimes called “First and Last”. There are some other types but then we are getting too academic for this time of the day.
If we do nothing about our dimension design, we will end up with all Type 1 attributes. This might be OK for “Customer Last Name” which may change, but as long as we don’t need to analyze how many “Smiths” bought a particular product then Type 1 should be just fine. “Sales By Region” or “Scores by School” are classic examples where Type 2 SCD design is definitely needed. And the trick to enable this design is both simple and ingenious.
For the customer who moves from one region to another or the student who moves to a new school, we need to create a row in the dimension table for each move with a corresponding “Start Date” and “End Date” to indicate the period they were there. However, relational constraints limit the primary key to a single unique value per row, so enter the concept of the surrogate key. Within the data warehouse design, the original business key from the operational system becomes just another attribute. Then we add a surrogate key as the primary key and have that system generate it to be unique. Now we can have multiple rows for a customer or student with accurate start and end dates for the history of the moves. The single row without an end date value indicates the current location. And here is the ingenious part: we only need to join to the fact table using the surrogate key because, as long as we load the data correctly with our incremental load processes, the end user will be able to analyze customer by region or scores by school with confidence of accuracy even when analyzing over time-periods. The “magic” is in the data load. Don’t worry; I was skeptical too, until I tested I out. It’s an eureka moment when you see it working for the first time.
The Business Analysts then perform their analysis as normal, using high performing cubes or multi-dimensional databases, oblivious to the underlying complexity using their drag-and-drop GUI in Excel, and it’s “Business Intelligence as Usual” except with deadly accuracy.