Dimension attributes that changes slowly over a period of time rather than changing regularly is called as slowly changing dimension (SCD). For Example address of the customer can change but not often these attribute can change over a period of time.
Type of SCD
Ø Type-0 : The Passive method(Ignore updates)
Ø Type-1 : Overwriting the old value
Ø Type-2 : Creating a new additional record to maintain the history
Ø Type-3 : Adding a new column
Ø Type-4 : Using Historical or archival table
Ø Type-6 : combine approach of type 1 ,2 and 3
SCD Type 0: The
Passive method
Using this approach we are not going to update any records
even we are getting any changes values. If we are getting new record in that
case we are inserting the new rows in the dimension table. When we are loading
the data into the dimension table using ETL in that case if we are getting any
updated record in that case either we are failing our ETL or we are redirecting
that row to some other destination for the manual intervention.
For example:
In the customer dimension below are the records
ID |
Name |
PAN_Number |
Aadhar_Number |
1 |
Bagesh Kumar Singh |
12345 |
44444 |
2 |
Mahesh |
22222 |
66666 |
Now are getting below records in the sating table to update
the PAN and Aadhar number in the
customer dimension
ID |
Name |
PAN_Number |
Aadhar_Number |
1 |
Bagesh Kumar Singh |
55555 |
33333 |
3 |
Suresh |
77777 |
88888 |
Now our Customer dimension will be like below. PAN and
Aadhar number are not updated for the customer ID 1.
ID |
Name |
PAN_Number |
Aadhar_Number |
1 |
Bagesh Kumar Singh |
12345 |
44444 |
2 |
Mahesh |
22222 |
66666 |
3 |
Suresh |
77777 |
88888 |
For the Customer ID 1 that row is redirected either Audit
table or Audit file for manual intervention.
ID |
Name |
PAN_Number |
Aadhar_Number |
1 |
Bagesh Kumar Singh |
55555 |
33333 |
Business can update this value manually.
SCD Type 1: Overwriting
the old value
Using this approach we are updating the existing records
with updated value without maintaining any history. Insert a new record into
the dimension table if that record does not exist in the dimension table. It is
very easy and saving huge amount of space because we are not maintaining any
history.
For example
In the customer dimension below are the records
ID |
Name |
Year |
Address |
1 |
Bagesh Kumar Singh |
2018 |
Pune |
2 |
Mahesh |
2018 |
Chennai |
Now are getting below records in the sating table and we are
updating the customer dimension
ID |
Name |
Year |
Address |
1 |
Bagesh Kumar Singh |
2019 |
Greater Noida |
3 |
Suresh |
2020 |
Mumbai |
Now our Customer dimension will be
ID |
Name |
Year |
Address |
1 |
Bagesh Kumar Singh |
2019 |
Greater Noida |
2 |
Mahesh |
2018 |
Chennai |
3 |
Suresh |
2020 |
Mumbai |
In the above load we get one existing ID with updated value
and one new record. Here is not mainlining any history. If anyone want to know,
what was the address of Bagesh (ID: 1) in 2018. In this approach only we are
getting latest value in the dimension table.
We are using such approach where we know that no historical
information is required in the Data ware house and reports.
Advantage of Type 1
·
This is the easiest way to handle the SCD
problem.
·
It is saving Hug amount of spaces in the
database because we are not saving the Historical information of the dimension.
Disadvantage of Type
1
·
We are not having the historical information. In
case someone updating the dimension records we may not able to recover the
updated information.
SCD Type 2: Creating
a new additional record to maintain the history
Using this approach we are maintaining the history of the
changing attributes. We are maintaining the history of the records by different
ways.
Ø
Using start date and end date
Ø
Using IsActive flag (True & False)
Ø
Using Yes or No values
Let’s see
the example
Maintaining the history using start date
and end date
In the
customer dimension below are the records
ID |
CUS_ID |
Name |
Year |
Address |
Start_Date |
End_date |
1 |
1 |
Bagesh Kumar Singh |
2018 |
Pune |
4/20/2018 |
Null |
2 |
2 |
Mahesh |
2018 |
Chennai |
4/20/2018 |
Null |
Now are getting below records in the sating table and we are
updating the customer dimension
CUS_ID |
Name |
Year |
Address |
1 |
Bagesh Kumar Singh |
2019 |
Greater Noida |
3 |
Suresh |
2020 |
Mumbai |
Now our Customer dimension will be
ID |
CUS_ID |
Name |
Year |
Address |
Start_date |
End_date |
1 |
1 |
Bagesh Kumar Singh |
2018 |
Pune |
4/20/2018 |
06/27/2019 |
2 |
2 |
Mahesh |
2018 |
Chennai |
4/20/2018 |
Null |
3 |
3 |
Suresh |
2020 |
Mumbai |
06/27/2019 |
Null |
4 |
1 |
Bagesh Kumar Singh |
2019 |
Greater Noida |
06/27/2019 |
Null |
When a new record is created in the dimension table in that
case we are storing the Start_date as Today date and Keeping End_date as Null
or any highest default date (12/31/9999). Here we are using the End date as
null.
If the existing records comes with the updated value in that
case first we are updated the Today date into the end_date column and then
inserting the new record in the dimension table with start_date as Today date
and End_date as null.
Those records are having End_date as null it means that
records are the Active records.
Maintaining the
history using IsActive flag (True & False)
In the customer dimension below are the records
ID |
CUS_ID |
Name |
Year |
Address |
Is_Active |
1 |
1 |
Bagesh Kumar Singh |
2018 |
Pune |
1 |
2 |
2 |
Mahesh |
2018 |
Chennai |
1 |
Now are getting below records in the sating table and we are
updating the customer dimension
CUS_ID |
Name |
Year |
Address |
1 |
Bagesh Kumar Singh |
2019 |
Greater Noida |
3 |
Suresh |
2020 |
Mumbai |
Now our Customer dimension will be
ID |
CUS_ID |
Name |
Year |
Address |
Is_Active |
1 |
1 |
Bagesh Kumar Singh |
2018 |
Greater Noida |
0 |
2 |
2 |
Mahesh |
2018 |
Chennai |
1 |
3 |
3 |
Suresh |
2020 |
Mumbai |
1 |
4 |
1 |
Bagesh Kumar Singh |
2019 |
Greater Noida |
1 |
When a new record is created in the dimension table in that
case we are storing the Is_Active flag with 1.
If the existing records comes
with the updated value in that case first we are updated the Is_Active flag 0
and then inserting the new record in the
dimension table with Is_Active flag to 1. Those records are having Is_Active
flag 1 it means that records are the Active records.
Maintaining the history using Yes or No values
It is similar to the above
approach (Is_Active) . here we are using CURR_REC_FL
column which is having Yes, No value. Yes means Current record and No means
historical record.
SCD Type-3: Adding a new column
In this approach usually only the current and previous value of
dimension is kept in the database. The new value is loaded into 'current/new'
column and the old one into 'old/previous' column. Generally speaking the
history is limited to the number of column created for storing historical data.
For example we can have the city name or state name.
See the below.
ID |
Name |
Current_City |
Previous_City |
1 |
Bagesh Kumar Singh |
Chennai |
Madras |
2 |
Mahesh |
Mumbai |
Bombay |
SCD Type-4: Using Historical or
archival table
In this
approach we are having two table main tables and the historical or Archive
table to maintain the history. Basically we are using the logic of SCD type 2 (Maintaining the history using start date
and end date) only the difference
between them in type two we are storing the historical record in to the same
dimension table but in the SCD Type 4 Approach we are storing the current
active record into the main table and the historical record into the another
Historical table or Archive table.
We have
DimCustomer (main table), DimCustomer_Archive (Archive table) and the
DimCustomer_STG (Stage table).
Table :
DimCustomer
ID |
CUS_ID |
Name |
Year |
Address |
Start_Date |
End_date |
1 |
1 |
Bagesh Kumar Singh |
2018 |
Pune |
4/20/2018 |
Null |
2 |
2 |
Mahesh |
2018 |
Chennai |
4/20/2018 |
Null |
Table : DimCustomer_Archive
ID |
CUS_ID |
Name |
Year |
Address |
Start_Date |
End_date |
|
|
|
|
|
|
|
Now are getting below records in the sating table and we are
updating the customer dimension
Table : DimCustomer_STG
CUS_ID |
Name |
Year |
Address |
1 |
Bagesh Kumar Singh |
2019 |
Greater Noida |
3 |
Suresh |
2020 |
Mumbai |
Now our Customer dimension will be
Table :
DimCustomer
ID |
CUS_ID |
Name |
Year |
Address |
Start_date |
End_date |
2 |
2 |
Mahesh |
2018 |
Chennai |
4/20/2018 |
Null |
3 |
3 |
Suresh |
2020 |
Mumbai |
06/27/2019 |
Null |
4 |
1 |
Bagesh Kumar Singh |
2019 |
Greater Noida |
06/27/2019 |
Null |
Table : DimCustomer_Archive
ID |
CUS_ID |
Name |
Year |
Address |
Start_date |
End_date |
1 |
1 |
Bagesh Kumar Singh |
2018 |
Greater Noida |
4/20/2018 |
06/27/2019 |
When a new record is created in the dimension table in that
case we are storing the Start_date as Today date and Keeping End_date as Null
or any highest default date (12/31/9999). Here we are using the End date as
null.
If the existing records comes with the updated value in that
case first we are we are inserting the record into the archive table with
filling End_date as today date and then deleting the record from the Main
table. In the main table (DimCustomer) we always get latest records.
SCD Type-6: combine approach of type 1 ,2 and 3
Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension table such additional columns as:
· current_type - for keeping current value of the attribute. All history records for given item of attribute have the same current value.
· historical_type - for keeping historical value of the attribute. All history records for given item of attribute could have different values.
· start_date - for keeping start date of 'effective date' of attribute's history.
· end_date - for keeping end date of 'effective date' of attribute's history.
· current_flag - for keeping information about the most recent record.
No comments:
Post a Comment
If you have any doubt, please let me know.