Microsoft Business Intelligence: May 2022

Sunday 29 May 2022

Difference between Dimension tables and Fact table

Below is the difference between them

Parameters	Fact Table	Dimension Table
Definition	Measurements, metrics or facts about a business process.	Companion table to the fact table contains descriptive attributes to be used as query constraining.
Characteristic	Located at the center of a star or snowflake schema and surrounded by dimensions.	Connected to the fact table and located at the edges of the star or snowflake schema
Design	Defined by their grain or its most atomic level.	Should be wordy, descriptive, complete, and quality assured.
Task	Fact table is a measurable event for which dimension table data is collected and is used for analysis and reporting.	Collection of reference information about a business.
Type of Data	Facts tables could contain information like sales against a set of dimensions like Product and Date.	Evert dimension table contains attributes which describe the details of the dimension. E.g., Product dimensions can contain Product ID, Product Category, etc.
Key	Primary Key in fact is mapped as foreign keys to Dimensions.	Foreign key to the facts table
Storage	Helps to store report labels and filter domain values in dimension tables.	Load detailed atomic data into dimensional structures.
Hierarchy	Does not contain Hierarchy	Contains Hierarchies. For example Location could contain, country, pin code, state, city, etc.

Surrogate Key in data ware house

In the data ware house we generally we are hearing about surrogate key when we are implementing the SDC Type -2. A Surrogate Key in SQL Server is a unique identifier for each row in the table. It is just a key. Using this key we can identify a unique row. There is no business meaning for Surrogate Keys. A Surrogate Key is just a unique identifier for each row and it may use as a Primary Key. There is only requirement for a surrogate Primary Key, which is that each row must have a unique value for that column. A Surrogate Key is also known as an artificial key or identity key. It is unique value and it is generally generated by system.

For example: see the table of SCD type 2 dimension

https://bageshkumarbagi-msbi.blogspot.com/2022/05/implementation-of-slowly-changing_17.html

CREATE TABLE dimcustomer_scd_2

(

dimcustomer_scd_2_id INT IDENTITY(1, 1) PRIMARY KEY,

cus_id INT,

cus_nm VARCHAR(50),

cus_year VARCHAR(10),

cus_add VARCHAR(500),

start_date DATE,

end_date DATE

)

dimcustomer_scd_2_id is the Surrogate key for this dimension table.

Differentiate between Top-Down Design Approach and Bottom-Up Design Approach

Below is the difference between them

Top-Down Design Approach	Bottom-Up Design Approach
Breaks the vast problem into smaller sub problems.	Solves the essential low-level problem and integrates them into a higher one.
Inherently architected- not a union of several data marts.	Inherently incremental; can schedule essential data marts first.
Single, central storage of information about the content.	Departmental information stored.
Centralized rules and control.	Departmental rules and control.
It includes redundant information.	Redundancy can be removed.
It may see quick results if implemented with repetitions.	Less risk of failure, favorable return on investment, and proof of techniques.

Read here: Top- down design approach

https://bageshkumarbagi-msbi.blogspot.com/2022/05/inmons-top-down-approach-of-data-ware.html

Read here: Bottom - Up design approach

https://bageshkumarbagi-msbi.blogspot.com/2022/05/kimballs-bottom-up-approach.html

Kimball’s Bottom – up approach

In this Approach first of all we are loading the data marts after that we are loading the data ware house. This approach is also called dimensional modeling approach.

In this approach, we are extracting the data from difference data source and loading the data into the staging table using ETL and after that we are loading these data to the Data marts. Once data loaded into the data Marts we are loading the data into the data ware house using ETL tool.

Inmon’s Top – down approach of data ware house design

In this approach first we are creating data ware house and after that we are creating various data marts. These data marts are dependent on the data ware house and extract the essential data from it.

In this approach, we are extracting the data from difference data source and loading the data into the staging table using ETL and after that we are loading these data to the Data ware house. Once Data ware load is completed we are loading (refreshing) the Data marts using ETL tool.

Sunday 22 May 2022

Rapidly changing Dimension in DWH

Rapidly changing dimension are dimension where the attribute values of the dimension changes frequently causing the dimension grow rapidly if we have designed the dimension to capture the changes as Type 2 dimension. The rapid growth of dimension will impact maintenance and preformation of this dimension. Due to this reason we cannot use the SCD type. To overcome with this issue we are using rapidly changing Dimension approach.

In this approach we will split a dimension into 3 different tables. One table is having the non-changed attributes and other dimension table having the changed attribute with SCD type 2 implementation and 3^rd table is the bridge of table one and two.

For example

Suppose we have a Patient dimension where we have 1k rows in it. This dimension has below columns.

Ø Patient_id

Ø Name

Ø Gender

Ø Marital_status

Ø Weight

Ø BMI

Patient name gender and marital status are not changing very frequently. It is changing very rarely. But weight and BMI both are frequently change. Every visit it will update. Think if a patient visits 20 times in a year for that we need to create 20 records for the single patient (19 inactive and 1 active). If we take average 15 then every year we are creating 1000*15 =15000 records.

It is impacting our performance.

To overcome with this issue we will split this dimension into 3 tables as we discuss above.

Ø DimPatient (Updating the records using SCD type -1 approach)

Ø Patient_mini_dim( Updating/inserting using SCD type -2 approach)

Ø Patient_Junk(Incremental load)

Let’s see how to implement it in SQL Server

Below are the tables.

CREATE TABLE dimpatient_stg
  (
     dimpatient_stg_id INT IDENTITY(1, 1) PRIMARY KEY,
     patient_id        INT,
     p_nm              VARCHAR(50),
     gender            VARCHAR(10),
     marital_status    VARCHAR(20),
     p_weight          FLOAT,
     p_bmi             FLOAT
  )

CREATE TABLE dimpatient
  (
     dimpatient_id  INT IDENTITY(1, 1) PRIMARY KEY,
     patient_id     INT,
     p_nm           VARCHAR(50),
     gender         VARCHAR(10),
     marital_status VARCHAR(20)
  )

CREATE TABLE patient_mini_dim
  (
     patient_mini_dim INT IDENTITY(1, 1) PRIMARY KEY,
     patient_id       INT,
     start_dt         DATE,
     end_date         DATE
  )

CREATE TABLE patient_junk
  (
     patient_mini_dim INT,
     p_weight         FLOAT,
     p_bmi            FLOAT
  )

Inserting some records in these tables

INSERT INTO dimpatient_stg
            (patient_id,
             p_nm,
             gender,
             marital_status,
             p_weight,
             p_bmi)
VALUES      (1,
             'Bagesh Kumar Singh',
             'Male',
             'Marirred',
             85,
             33);

INSERT INTO dimpatient_stg
            (patient_id,
             p_nm,
             gender,
             marital_status,
             p_weight,
             p_bmi)
VALUES      (2,
             'Rajesh',
             'Male',
             'Marirred',
             75,
             31);

INSERT INTO dimpatient_stg
            (patient_id,
             p_nm,
             gender,
             marital_status,
             p_weight,
             p_bmi)
VALUES      (3,
             'Mohan',
             'Male',
             'Unmarirred',
             55,
             25);

INSERT INTO dimpatient
            (patient_id,
             p_nm,
             gender,
             marital_status)
VALUES      (1,
             'Bagesh Kumar Singh',
             'Male',
             'Marirred');

INSERT INTO dimpatient
            (patient_id,
             p_nm,
             gender,
             marital_status)
VALUES      (2,
             'Rajesh',
             'Male',
             'Marirred');

INSERT INTO patient_mini_dim
            (patient_id,
             start_dt)
VALUES      (1,
             '01-12-2022');

INSERT INTO patient_mini_dim
            (patient_id,
             start_dt)
VALUES      (2,
             '01-12-2022');

INSERT INTO patient_junk
            (patient_mini_dim,
             p_weight,
             p_bmi)
VALUES      (1,
             80,
             32);

INSERT INTO patient_junk
            (patient_mini_dim,
             p_weight,
             p_bmi)
VALUES      (2,
             70,
             30);

See the records in the tables.

There some records in the stage table.

Using below script we will the Stage table into the dimension table.

--Inserting new records in dimpatient and updating existing record (SCD type -1)
MERGE dimpatient T
using dimpatient_stg S
ON T.patient_id = s.patient_id
WHEN matched AND T.p_nm <> S.p_nm OR T.gender <> S.gender OR T.marital_status<>
s.gender THEN
  UPDATE SET T.p_nm = S.p_nm,
             T.gender = S.gender,
             T.marital_status = s.marital_status
WHEN NOT matched BY target THEN
  INSERT
  VALUES ( S.patient_id,
           S.p_nm,
           S.gender,
           s.marital_status);

-- updating the end dat of the patient( in patient_mini_dim) which are in stage comes for the load
UPDATE m
SET    m.end_date = Getdate()
FROM   patient_mini_dim m
       JOIN dimpatient_stg stg
         ON stg.patient_id = m.patient_id;

--Inserting the new records of the patient( in patient_mini_dim) which are in stage comes for the load with start date and end date as null
INSERT INTO patient_mini_dim
            (patient_id,
             start_dt)

SELECT DISTINCT stg.patient_id,
                Getdate() AS start_dt
FROM   dimpatient_stg stg
       LEFT JOIN patient_mini_dim mdim
              ON stg.patient_id = mdim.patient_id
WHERE  end_date IS NULL;

-----------Populating the Patient_Junk table
INSERT INTO patient_junk
            (patient_mini_dim,
             p_weight,
             p_bmi)
SELECT mini.patient_mini_dim,
       stg.p_weight,
       stg.p_bmi
FROM   dimpatient_stg stg
       JOIN patient_mini_dim mini
         ON mini.patient_id = stg.patient_id
WHERE  mini.end_date IS NULL;

Running this script.

Now see the records in the tables.

Now if we are getting below records in the stage to load.

INSERT INTO dimpatient_stg
            (patient_id,
             p_nm,
             gender,
             marital_status,
             p_weight,
             p_bmi)
VALUES      (1,
             'Bagesh Kumar Bagi',
             'Male',
             'Marirred',
             90,
             35);

INSERT INTO dimpatient_stg
            (patient_id,
             p_nm,
             gender,
             marital_status,
             p_weight,
             p_bmi)
VALUES      (3,
             'Mohan',
             'Male',
             'Marirred',
             55,
             25);

INSERT INTO dimpatient_stg
            (patient_id,
             p_nm,
             gender,
             marital_status,
             p_weight,
             p_bmi)
VALUES      (4,
             'Rohan',
             'Male',
             'Marirred',
             75,
             31);

Now running the script.

Data is loaded successfully.

Microsoft Business Intelligence

Pages