Saturday 21 May 2022

Slowly changing Dimension in DWH

Dimension attributes that changes slowly over a period of time rather than changing regularly is called as slowly changing dimension (SCD). For Example address of the customer can change but not often these attribute can change over a period of time.

Type of SCD

Ø  Type-0 : The Passive method(Ignore updates)

Ø  Type-1 : Overwriting the old value

Ø  Type-2 : Creating a new additional record to maintain the history

Ø  Type-3 : Adding a new column

Ø  Type-4 : Using Historical or archival table

Ø  Type-6 : combine approach of type 1 ,2 and 3

SCD Type 0: The Passive method

Using this approach we are not going to update any records even we are getting any changes values. If we are getting new record in that case we are inserting the new rows in the dimension table. When we are loading the data into the dimension table using ETL in that case if we are getting any updated record in that case either we are failing our ETL or we are redirecting that row to some other destination for the manual intervention.

For example:

In the customer dimension below are the records

ID

Name

PAN_Number

Aadhar_Number

1

Bagesh Kumar Singh

12345

44444

2

Mahesh

22222

66666

Now are getting below records in the sating table to update the PAN  and Aadhar number in the customer dimension

ID

Name

PAN_Number

Aadhar_Number

1

Bagesh Kumar Singh

55555

33333

3

Suresh

77777

88888

Now our Customer dimension will be like below. PAN and Aadhar number are not updated for the customer ID 1.

ID

Name

PAN_Number

Aadhar_Number

1

Bagesh Kumar Singh

12345

44444

2

Mahesh

22222

66666

3

Suresh

77777

88888

 

For the Customer ID 1 that row is redirected either Audit table or Audit file for manual intervention.

ID

Name

PAN_Number

Aadhar_Number

1

Bagesh Kumar Singh

55555

33333

Business can update this value manually.

SCD Type 1: Overwriting the old value

Using this approach we are updating the existing records with updated value without maintaining any history. Insert a new record into the dimension table if that record does not exist in the dimension table. It is very easy and saving huge amount of space because we are not maintaining any history.

For example

In the customer dimension below are the records

ID

Name

Year

Address

1

Bagesh Kumar Singh

2018

Pune

2

Mahesh

2018

Chennai

Now are getting below records in the sating table and we are updating the customer dimension

ID

Name

Year

Address

1

Bagesh Kumar Singh

2019

Greater Noida

3

Suresh

2020

Mumbai

Now our Customer dimension will be

ID

Name

Year

Address

1

Bagesh Kumar Singh

2019

Greater Noida

2

Mahesh

2018

Chennai

3

Suresh

2020

Mumbai

In the above load we get one existing ID with updated value and one new record. Here is not mainlining any history. If anyone want to know, what was the address of Bagesh (ID: 1) in 2018. In this approach only we are getting latest value in the dimension table.

We are using such approach where we know that no historical information is required in the Data ware house and reports.

Advantage of Type 1

·         This is the easiest way to handle the SCD problem.

·         It is saving Hug amount of spaces in the database because we are not saving the Historical information of the dimension.

Disadvantage of Type 1

·         We are not having the historical information. In case someone updating the dimension records we may not able to recover the updated information.

SCD Type 2: Creating a new additional record to maintain the history

Using this approach we are maintaining the history of the changing attributes. We are maintaining the history of the records by different ways.

Ø  Using start date and end date

Ø  Using IsActive flag (True & False)

Ø  Using Yes or No values

Let’s see the example

Maintaining the history using start date and end date

In the customer dimension below are the records

ID

CUS_ID

Name

Year

Address

Start_Date

End_date

1

1

Bagesh Kumar Singh

2018

Pune

4/20/2018

Null

2

2

Mahesh

2018

Chennai

4/20/2018

Null

Now are getting below records in the sating table and we are updating the customer dimension

CUS_ID

Name

Year

Address

1

Bagesh Kumar Singh

2019

Greater Noida

3

Suresh

2020

Mumbai

Now our Customer dimension will be

ID

CUS_ID

Name

Year

Address

Start_date

End_date

1

1

Bagesh Kumar Singh

2018

Pune

4/20/2018

06/27/2019

2

2

Mahesh

2018

Chennai

4/20/2018

Null

3

3

Suresh

2020

Mumbai

06/27/2019

Null

4

1

Bagesh Kumar Singh

2019

Greater Noida

06/27/2019

Null

When a new record is created in the dimension table in that case we are storing the Start_date as Today date and Keeping End_date as Null or any highest default date (12/31/9999). Here we are using the End date as null.

If the existing records comes with the updated value in that case first we are updated the Today date into the end_date column and then inserting the new record in the dimension table with start_date as Today date and End_date as null.

Those records are having End_date as null it means that records are the Active records.

Maintaining the history using IsActive flag (True & False)

In the customer dimension below are the records

ID

CUS_ID

Name

Year

Address

Is_Active

1

1

Bagesh Kumar Singh

2018

Pune

1

2

2

Mahesh

2018

Chennai

1

Now are getting below records in the sating table and we are updating the customer dimension

CUS_ID

Name

Year

Address

1

Bagesh Kumar Singh

2019

Greater Noida

3

Suresh

2020

Mumbai

Now our Customer dimension will be

ID

CUS_ID

Name

Year

Address

Is_Active

1

1

Bagesh Kumar Singh

2018

Greater Noida

0

2

2

Mahesh

2018

Chennai

1

3

3

Suresh

2020

Mumbai

1

4

1

Bagesh Kumar Singh

2019

Greater Noida

1

When a new record is created in the dimension table in that case we are storing the Is_Active flag with 1.

If the existing records comes with the updated value in that case first we are updated the Is_Active flag 0 and  then inserting the new record in the dimension table with Is_Active flag to 1. Those records are having Is_Active flag 1 it means that records are the Active records.

Maintaining the history using Yes or No values

It is similar to the above approach (Is_Active) . here we are using CURR_REC_FL column which is having Yes, No value. Yes means Current record and No means historical record.

SCD Type-3: Adding a new column

In this approach usually only the current and previous value of dimension is kept in the database. The new value is loaded into 'current/new' column and the old one into 'old/previous' column. Generally speaking the history is limited to the number of column created for storing historical data.

For example we can have the city name or state name.

See the below.

ID

Name

Current_City

Previous_City

1

Bagesh Kumar Singh

Chennai

Madras

2

Mahesh

Mumbai

Bombay

 

SCD Type-4: Using Historical or archival table

In this approach we are having two table main tables and the historical or Archive table to maintain the history. Basically we are using the logic of SCD type 2 (Maintaining the history using start date and end date) only the difference between them in type two we are storing the historical record in to the same dimension table but in the SCD Type 4 Approach we are storing the current active record into the main table and the historical record into the another Historical table or Archive table.

 See the example.

We have DimCustomer (main table), DimCustomer_Archive (Archive table) and the DimCustomer_STG (Stage table).

 In the customer dimension below are the records

Table : DimCustomer

ID

CUS_ID

Name

Year

Address

Start_Date

End_date

1

1

Bagesh Kumar Singh

2018

Pune

4/20/2018

Null

2

2

Mahesh

2018

Chennai

4/20/2018

Null

Table : DimCustomer_Archive

ID

CUS_ID

Name

Year

Address

Start_Date

End_date

 

 

 

 

 

 

 

Now are getting below records in the sating table and we are updating the customer dimension

Table : DimCustomer_STG

CUS_ID

Name

Year

Address

1

Bagesh Kumar Singh

2019

Greater Noida

3

Suresh

2020

Mumbai

Now our Customer dimension will be

Table : DimCustomer

ID

CUS_ID

Name

Year

Address

Start_date

End_date

2

2

Mahesh

2018

Chennai

4/20/2018

Null

3

3

Suresh

2020

Mumbai

06/27/2019

Null

4

1

Bagesh Kumar Singh

2019

Greater Noida

06/27/2019

Null

Table : DimCustomer_Archive

ID

CUS_ID

Name

Year

Address

Start_date

End_date

1

1

Bagesh Kumar Singh

2018

Greater Noida

4/20/2018

06/27/2019

When a new record is created in the dimension table in that case we are storing the Start_date as Today date and Keeping End_date as Null or any highest default date (12/31/9999). Here we are using the End date as null.

If the existing records comes with the updated value in that case first we are we are inserting the record into the archive table with filling End_date as today date and then deleting the record from the Main table. In the main table (DimCustomer) we always get latest records.

SCD Type-6: combine approach of type 1 ,2 and 3

Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension table such additional columns as:

·         current_type - for keeping current value of the attribute. All history records for given item of attribute have the same current value.

·         historical_type - for keeping historical value of the attribute. All history records for given item of attribute could have different values.

·         start_date - for keeping start date of 'effective date' of attribute's history.

·         end_date - for keeping end date of 'effective date' of attribute's history.

·         current_flag - for keeping information about the most recent record.

No comments:

Post a Comment

If you have any doubt, please let me know.

Popular Posts