Microsoft Business Intelligence

Sunday, 3 July 2016

Data profiling in SSIS

The Data Profiling task is used to perform analysis of data patterns within a SQL Server table. This analysis is useful for examining data prior to loading it into a final destination, like a data warehouse. By analyzing data and determining the patterns in the data we can determine how clean the data might be, prior to loading it into the data warehouse. By performing a profiling task on incoming data we are able to verify our new data meets the quality we expect prior to loading the data into its final location. If the data doesn’t meet the quality we normally expect data profiling allows and we to reject the incoming data.

See the example

Open the SSDT

Take Data Profiling task on Control flow

Double click on the task. Data profiling task editor will be open.

Creating the XML file to store the result of the data profiling and configuring the xml file

After configuring the file we need to configure the database.

Now click on the Quick profile

Single table Quick profile form will be open.

Here we need to create the ADO.NET Connection and select the table or view for which we want to Profiling.

Here I am taking AdventureWorksDW2008R2 database.

Click ok.

Now we need to select the table or view.

And also select the Compute column value.

Now we can set the Profile request.

Here we can select the column name if we want to profiling if we give the (*) it means we are profiling the all column value.

Profile Types

There are two types of profiles: single-column and multi-column| multi-table.

Single-Column Profiles:

Ø Column Length Distribution: As the name implies, this profiles the length of the values in each column. This profiler is Reports all the distinct lengths of string values in the selected column and the percentage of rows in the table that each length represents. This profile helps us identify problems in our data, such as values that are not valid. For example, you profile a column of United States state codes that should be two characters and discover values longer than two characters.

Ø Column Pattern: This Reports a set of regular expressions that cover the specified percentage of values in a string column. This profile helps us to identify problems in our data, such as strings that are not valid. This profile can also suggest regular expressions that can be used in the future to validate new values. For example, a pattern profile of a United States Zip Code column might produce the regular expressions: \d{5}-\d{4}, \d{5}, and \d {9}. If we see other regular expressions, our data likely contains values that are not valid or in an incorrect format.

Ø Column NULL Ratio: It shows the percentage of NULL values in a column. This profile helps us to identify problems in our data, such as an unexpectedly high ratio of null values in a column. For example, we profile a Zip Code/Postal Code column and discover an unacceptably high percentage of missing codes.

Ø Column Statistics: It displays the minimum, maximum, average, and standard deviation of date and number values. This profile helps us to identify problems in our data, such as dates that are not valid. For example, we profile a column of historical dates and discover a maximum date that is in the future.

Ø Column Value Distribution

It identifies how many different values are present in a column. This profile helps us to identify problems in our data, such as an incorrect number of distinct values in a column. For example, we profile a column that is supposed to contain states in the United States and discover more than 50 distinct values.

Multi-Column| Table Profiles:

Ø Candidate Key: It reports whether a column or set of columns is a key, or an approximate key, for the selected table. This profile also helps you identify problems in your data, such as duplicate values in a potential key column.

Ø Functional Dependency: It reports the extent to which the values in one column (the dependent column) depend on the values in another column or set of columns (the determinant column).

Ø Value Inclusion: It computes the overlap in the values between two columns or sets of columns. This profile can determine whether a column or set of columns is appropriate to serve as a foreign key between the selected tables.

Now it is ready to run. I am executing the package.