Updating a Slowly Changing Dimension with a Recursive Relationship on its Key
A couple of weeks ago an article was reposted on SQLServerCentral.com, which discusses hierarchies in SQL (by GSquared). The basic premise of the article is how to handle parent / child relationships in the same entity table, such as a manager / employee relationship within the employees table. As GSquared mentions and links to sources on the same, once the ID / ParentID columns are set up, generating the parent / child hierarchy can be handled quite easily using a recursive Common Table Expression (CTE). But then someone in the discussion thread on the article made things a lot more complicated, asking the same question I had to solve not a week before.
What if you’ve got a recursive relationship on the surrogate key of a slowly changing dimension (SCD)? So what, just add a new record and update the parent surrogate ID. Done right? Not so fast.
How Dimensions Interact with Fact Tables and Reports
Before we can dive into how to solve this problem we have, but don’t know we have… huh? … First let’s talk about how dimensions are used to build fact tables and in reporting.
When a fact table is built, each row is filled with the surrogate keys of each related dimension and the facts relevant at the time of record. A record in the fact table and associated account dimension entity might look something like this:
And then let’s say the next time a record was recorded in this fact table for this product flow, the related account had a change in contact info, which generated a new record in the slowly changing account dimension. Our records now look like so:
Notice the SSIS pack managing your SCD dutifully updated the RecordUseStart and RecordUseEnd fields, so the fact builder would know to use the new account record when building the new row. It uses a ranking script you created to update the UseStart and UseEnd fields based on RecordInsert rank. (I’ve uploaded a ranking script to use for this operation at the bottom of this post) The newer record’s start date is the previous records end date. This bit of logic allows reports to show the account level attributes at the time the information was recorded. A very important piece of any proper data warehouse.
In a July report on production flow under each contact, the information would look something like this:
But what if this account had a parent?
Dealing With Parent / Child Keys in an SCD
New scenario: A parent account uses child accounts to do its ordering and, therefore, child accounts are associated with the product flows in your fact table. The parent / child relationship is held in the non-SCD Accounts table in the transactional system. The Account table is how your SSIS pack builds the surrogate key recursive relationship. Let’s add parents into the schema and report from above:
So the next day, another contact is changed, but this time in the parent. Your SSIS pack puts the new parent record into the dimension and updates the UseStart and UseEnd dates perfectly. Your fact table builds perfectly, just like it did the previous day. But you get a call from management. The report does not reflect the change in contact. It still shows Kelly Achtung as the parent contact. … Hmm this isn’t right. It’s supposed to show 2 records for Kelly and 2 for the new contact, Jack Sterns. You check your DimAccounts table:
It’s not a bug; it’s a feature!
The ParentAccountKey didn’t update! The child accounts should now have 763 as their parent account. … But then you’d not be able to generate data with historical accuracy, because the child accounts will no longer be related to the old parent account. … So you need to keep the original relationship, but make new records for the new relationship.
In order for your SSIS pack to populate ParentAcctKey fields correctly, it needs to know which parents are associated to the children at which times of production. The transactional table is no help, as it doesn’t hold history, so the SSIS pack uses the following query leveraging RecordUseStart and RecordUseEnd dates to figure it out.
--JA: Set ParentAccountKey UPDATE TChild SET TChild.ParentAcctKey = CASE WHEN TChild.NtrlAcctKey = TChild.NtrlParentAcctKey THEN TChild.AcctKey ELSE TParent.AcctKey END FROM @DimAcct TChild INNER JOIN @DimAcct TParent ON TChild.NtrlParentAcctKey = TParent.NtrlAcctKey AND TChild.RecordUseStart >= TParent.RecordUseStart AND TChild.RecordUseEnd <= TParent.RecordUseEnd
Logic tells us that in order for a child record to have the same parent throughout its lifespan, the parent must have RecordUseStart and RecordUseEnd dates that encapsulate the dates of the child record. This way any date referenced that pulls up the child record will also pull up just the one parent.
Here are the different scenarios the SSIS pack must figure out:
The first scenario is cut and dry. Parent and child both have new records generated at the same time, so they both have the same RecordUseStart and RecordUseEnd dates. Nothing interesting here. The SSIS pack will easily assign the correct parent.
Second scenario is also easy. Child gets new record. Child has newer RecordUseStart date than its parent and the same RecordUseEnd date as its parent, so again the SSIS pack assigns the correct parent. No extra code needed.
Now the third scenario is odd. The SSIS pack created a new parent, and now the child does not have the same parent at all points of its lifetime. At the beginning of its life, it should point to the original parent record, but now its parent record has received new data and been split to a new record, so it should point to the new parent from that point on. The SSIS pack returns two records in the join from above, and no matter which you have it choose the choice will still be wrong at some point in time.
Like Father, Like Son
At this point the obvious (obvious if you’ve had your coffee this morning) solution is to force the child to split when the parent does. If you split the child at the same time, the choice for parent becomes simple, just like the image from above the RecordUseStart and RecordUseEnd dates will match perfectly. So how do we go about picking out the children who’s parents just received new records?
Alter the statement from above to collect a list of children who have duplicate parent records after the SSIS package changes the dimension. The statement looks like this:
INSERT @DimAcct SELECT 764 --JA: simulating IDENTITY(1,1) ,TParentNew.AcctKey ,TChild.NtrlAcctKey ,TChild.NtrlParentAcctKey ,TChild.AcctName ,TChild.AcctContactFirst ,TChild.AcctContactLast ,TParentNew.RecordUseStart ,TParentNew.RecordUseEnd ,TParentNew.RecordInserted FROM @DimAcct TChild INNER JOIN @DimAcct TParentOld ON TChild.ParentAcctKey = TParentOld.AcctKey AND TChild.RecordUseEnd > TParentOld.RecordUseEnd INNER JOIN @DimAcct TParentNew ON TChild.NtrlParentAcctKey = TParentNew.NtrlAcctKey AND TChild.RecordUseEnd > TParentNew.RecordUseStart AND TChild.RecordUseEnd <= TParentNew.RecordUseEnd
This script links old and new parent records to children with Start and End dates that span the old and new parent records. It inserts a new record into the SCD for the child pre-assigned to the new parent record. It also uses the Start and End of the new parent record. All that’s left is to update the old child record’s RecordUseEnd field, so it doesn’t overlap with the new child record. Re-run the script you use to set the RecordUseStart and RecordUseEnd dates in chronological order (check file linked below).
Mystery Solved!
So there we have it. Now when we run a query at any time, we’ll get the correct child account along with its correct parent. Now I’m sure there is a more impressive way to get the job done on this one, but this is good enough for me (seems I constantly run out of time when I’m developing!). I hope this post helps shed light on this SCD dilemma.
Please comment if you find a better way to do this, or have any questions.