Sunday, 29 May 2022

Difference between Dimension tables and Fact table

 Below is the difference between them

Parameters

Fact Table

Dimension Table

Definition

Measurements, metrics or facts about a business process.

Companion table to the fact table contains descriptive attributes to be used as query constraining.

Characteristic

Located at the center of a star or snowflake schema and surrounded by dimensions.

Connected to the fact table and located at the edges of the star or snowflake schema

Design

Defined by their grain or its most atomic level.

Should be wordy, descriptive, complete, and quality assured.

Task

Fact table is a measurable event for which dimension table data is collected and is used for analysis and reporting.

Collection of reference information about a business.

Type of Data

Facts tables could contain information like sales against a set of dimensions like Product and Date.

Evert dimension table contains attributes which describe the details of the dimension. E.g., Product dimensions can contain Product ID, Product Category, etc.

Key

Primary Key in fact is mapped as foreign keys to Dimensions.

Foreign key to the facts table

Storage

Helps to store report labels and filter domain values in dimension tables.

Load detailed atomic data into dimensional structures.

Hierarchy

Does not contain Hierarchy

Contains Hierarchies. For example Location could contain, country, pin code, state, city, etc.

 

Surrogate Key in data ware house

In the data ware house we generally we are hearing about surrogate key when we are implementing the SDC Type -2. A Surrogate Key in SQL Server is a unique identifier for each row in the table. It is just a key. Using this key we can identify a unique row. There is no business meaning for Surrogate Keys. A Surrogate Key is just a unique identifier for each row and it may use as a Primary Key. There is only requirement for a surrogate Primary Key, which is that each row must have a unique value for that column. A Surrogate Key is also known as an artificial key or identity key. It is unique value and it is generally generated by system.

For example:  see the table of SCD type 2 dimension

https://bageshkumarbagi-msbi.blogspot.com/2022/05/implementation-of-slowly-changing_17.html

CREATE TABLE dimcustomer_scd_2

  (

     dimcustomer_scd_2_id INT IDENTITY(1, 1) PRIMARY KEY,

     cus_id               INT,

     cus_nm               VARCHAR(50),

     cus_year             VARCHAR(10),

     cus_add              VARCHAR(500),

     start_date           DATE,

     end_date             DATE

  )

 

dimcustomer_scd_2_id  is the Surrogate key for this dimension table.

Differentiate between Top-Down Design Approach and Bottom-Up Design Approach

 Below is the difference between them

Top-Down Design Approach

Bottom-Up Design Approach

Breaks the vast problem into smaller sub problems.

Solves the essential low-level problem and integrates them into a higher one.

Inherently architected- not a union of several data marts.

Inherently incremental; can schedule essential data marts first.

Single, central storage of information about the content.

Departmental information stored.

Centralized rules and control.

Departmental rules and control.

It includes redundant information.

Redundancy can be removed.

It may see quick results if implemented with repetitions.

Less risk of failure, favorable return on investment, and proof of techniques.

Read here: Top- down design approach

https://bageshkumarbagi-msbi.blogspot.com/2022/05/inmons-top-down-approach-of-data-ware.html

Read here: Bottom - Up design approach

https://bageshkumarbagi-msbi.blogspot.com/2022/05/kimballs-bottom-up-approach.html

Kimball’s Bottom – up approach

In this Approach first of all we are loading the data marts after that we are loading the data ware house. This approach is also called dimensional modeling approach.

   


In this approach, we are extracting the data from difference data source and loading the data into the staging table using ETL and after that we are loading these data to the Data marts. Once data loaded into the data Marts we are loading the data into the data ware house using ETL tool.

Inmon’s Top – down approach of data ware house design

In this approach first we are creating data ware house and after that we are creating various data marts. These data marts are dependent on the data ware house and extract the essential data from it. 

In this approach, we are extracting the data from difference data source and loading the data into the staging table using ETL and after that we are loading these data to the Data ware house. Once Data ware load is completed we are loading (refreshing) the Data marts using ETL tool.

Sunday, 22 May 2022

Rapidly changing Dimension in DWH

Rapidly changing dimension are dimension where the attribute values of the dimension changes frequently causing the dimension grow rapidly if we have designed the dimension to capture the changes as Type 2 dimension. The rapid growth of dimension will impact maintenance and preformation of this dimension.  Due to this reason we cannot use the SCD type.  To overcome with this issue we are using rapidly changing Dimension approach.

In this approach we will split a dimension into 3 different tables. One table is having the non-changed attributes and other dimension table having the changed attribute with SCD type 2 implementation and 3rd table is the bridge of table one and two.

 For example

Suppose we have a Patient dimension where we have 1k rows in it. This dimension has below columns.

Ø  Patient_id

Ø  Name

Ø  Gender

Ø  Marital_status

Ø  Weight

Ø  BMI

Patient name gender and marital status are not changing very frequently. It is changing very rarely. But weight and BMI both are frequently change. Every visit it will update. Think if a patient visits 20 times in a year for that we need to create 20 records for the single patient (19 inactive and 1 active). If we take average 15 then every year we are creating 1000*15 =15000 records.

It is impacting our performance.

To overcome with this issue we will split this dimension into 3 tables as we discuss above.

Ø  DimPatient (Updating the records using SCD type -1 approach)

Ø  Patient_mini_dim( Updating/inserting using SCD type -2 approach)

Ø  Patient_Junk(Incremental load)

Let’s see how to implement it in SQL Server

Below are the tables.

CREATE TABLE dimpatient_stg
  
(
     
dimpatient_stg_id INT IDENTITY(1, 1) PRIMARY KEY,
     
patient_id        INT,
     
p_nm              VARCHAR(50),
     
gender            VARCHAR(10),
     
marital_status    VARCHAR(20),
     
p_weight          FLOAT,
     
p_bmi             FLOAT
  
)

CREATE TABLE dimpatient
  
(
     
dimpatient_id  INT IDENTITY(1, 1) PRIMARY KEY,
     
patient_id     INT,
     
p_nm           VARCHAR(50),
     
gender         VARCHAR(10),
     
marital_status VARCHAR(20)
  
)

CREATE TABLE patient_mini_dim
  
(
     
patient_mini_dim INT IDENTITY(1, 1) PRIMARY KEY,
     
patient_id       INT,
     
start_dt         DATE,
     
end_date         DATE
  
)

CREATE TABLE patient_junk
  
(
     
patient_mini_dim INT,
     
p_weight         FLOAT,
     
p_bmi            FLOAT
  
) 

 Inserting some records in these tables

INSERT INTO dimpatient_stg
            
(patient_id,
             
p_nm,
             
gender,
             
marital_status,
             
p_weight,
             
p_bmi)
VALUES      (1,
             
'Bagesh Kumar Singh',
             
'Male',
             
'Marirred',
             85
,
             33
);

INSERT INTO dimpatient_stg
            
(patient_id,
             
p_nm,
             
gender,
             
marital_status,
             
p_weight,
             
p_bmi)
VALUES      (2,
             
'Rajesh',
             
'Male',
             
'Marirred',
             75
,
             31
);

INSERT INTO dimpatient_stg
            
(patient_id,
             
p_nm,
             
gender,
             
marital_status,
             
p_weight,
             
p_bmi)
VALUES      (3,
             
'Mohan',
             
'Male',
             
'Unmarirred',
             55
,
             25
);

INSERT INTO dimpatient
            
(patient_id,
             
p_nm,
             
gender,
             
marital_status)
VALUES      (1,
             
'Bagesh Kumar Singh',
             
'Male',
             
'Marirred');

INSERT INTO dimpatient
            
(patient_id,
             
p_nm,
             
gender,
             
marital_status)
VALUES      (2,
             
'Rajesh',
             
'Male',
             
'Marirred');

INSERT INTO patient_mini_dim
            
(patient_id,
             
start_dt)
VALUES      (1,
             
'01-12-2022');

INSERT INTO patient_mini_dim
            
(patient_id,
             
start_dt)
VALUES      (2,
             
'01-12-2022');

INSERT INTO patient_junk
            
(patient_mini_dim,
             
p_weight,
             
p_bmi)
VALUES      (1,
             80
,
             32
);

INSERT INTO patient_junk
            
(patient_mini_dim,
             
p_weight,
             
p_bmi)
VALUES      (2,
             70
,
             30
); 

See the records in the tables. 

There some records in the stage table.

   

Using below script we will the Stage table into the dimension table.

--Inserting new records in dimpatient and updating existing record (SCD type -1)
MERGE dimpatient T
using dimpatient_stg S
ON T.patient_id = s.patient_id
WHEN matched AND T.p_nm <> S.p_nm OR T.gender <> S.gender OR T.marital_status<>
s.gender THEN
  
UPDATE SET T.p_nm = S.p_nm,
             
T.gender = S.gender,
             
T.marital_status = s.marital_status
WHEN NOT matched BY target THEN
  
INSERT
  
VALUES ( S.patient_id,
           
S.p_nm,
           
S.gender,
           
s.marital_status);

-- updating the end dat of the patient( in patient_mini_dim) which are in stage comes for the load 
UPDATE m
SET    m.end_date = Getdate()
FROM   patient_mini_dim m
       
JOIN dimpatient_stg stg
         
ON stg.patient_id = m.patient_id;

--Inserting the new records of the patient( in patient_mini_dim) which are in stage comes for the load with start date and end date as null
INSERT INTO patient_mini_dim
            
(patient_id,
             
start_dt)

SELECT DISTINCT stg.patient_id,
                
Getdate() AS start_dt
FROM   dimpatient_stg stg
       
LEFT JOIN patient_mini_dim mdim
              
ON stg.patient_id = mdim.patient_id
WHERE  end_date IS NULL; 

-----------Populating the Patient_Junk table
INSERT INTO patient_junk
            
(patient_mini_dim,
             
p_weight,
             
p_bmi)
SELECT mini.patient_mini_dim,
       
stg.p_weight,
       
stg.p_bmi
FROM   dimpatient_stg stg
       
JOIN patient_mini_dim mini
         
ON mini.patient_id = stg.patient_id
WHERE  mini.end_date IS NULL; 

Running this script.

Now see the records in the tables.  

Now if we are getting below records in the stage to load.

INSERT INTO dimpatient_stg
            
(patient_id,
             
p_nm,
             
gender,
             
marital_status,
             
p_weight,
             
p_bmi)
VALUES      (1,
             
'Bagesh Kumar Bagi',
             
'Male',
             
'Marirred',
             90
,
             35
);

INSERT INTO dimpatient_stg
            
(patient_id,
             
p_nm,
             
gender,
             
marital_status,
             
p_weight,
             
p_bmi)
VALUES      (3,
             
'Mohan',
             
'Male',
             
'Marirred',
             55
,
             25
);

INSERT INTO dimpatient_stg
            
(patient_id,
             
p_nm,
             
gender,
             
marital_status,
             
p_weight,
             
p_bmi)
VALUES      (4,
             
'Rohan',
             
'Male',
             
'Marirred',
             75
,
             31
); 

Now running the script.

 


Data is loaded successfully.

Popular Posts