The Data
Profiling task is used to perform analysis of data patterns within a SQL Server
table. This analysis is useful for examining data prior to loading
it into a final destination, like a data warehouse. By analyzing data and
determining the patterns in the data we can determine how clean the data might
be, prior to loading it into the data warehouse. By performing a
profiling task on incoming data we are able to verify our new data meets the
quality we expect prior to loading the data into its final location. If
the data doesn’t meet the quality we normally expect data profiling allows and
we to reject the incoming data.
See the example
Open the SSDT
Take Data Profiling task on Control
flow
Double click on the task. Data
profiling task editor will be open.
Creating the XML file to store the
result of the data profiling and configuring the xml file
After configuring the file we need to
configure the database.
Now click on the Quick profile
Single table Quick profile form will
be open.
Here we need to create the ADO.NET
Connection and select the table or view for which we want to Profiling.
Here I am taking AdventureWorksDW2008R2
database.
Click ok.
Now we need to select the table or
view.
And also select the Compute column
value.
Now we can set the Profile request.
Here we can select the column name if
we want to profiling if we give the (*) it means we are profiling the all
column value.
Profile
Types
There are
two types of profiles: single-column and multi-column| multi-table.
Single-Column Profiles:
Ø
Column
Length Distribution: As the name implies, this profiles the
length of the values in each column. This profiler is Reports
all the distinct lengths of string values in the selected column and the
percentage of rows in the table that each length represents. This profile helps
us identify problems in our data, such as values that are not valid. For
example, you profile a column of United States state codes that should be two
characters and discover values longer than two characters.
Ø
Column Pattern: This Reports a set of regular expressions that cover the
specified percentage of values in a string column. This profile helps us to
identify problems in our data, such as strings that are not valid. This profile
can also suggest regular expressions that can be used in the future to validate
new values. For example, a pattern profile of a United States Zip Code column
might produce the regular expressions: \d{5}-\d{4}, \d{5}, and \d {9}. If we
see other regular expressions, our data likely contains values that are not
valid or in an incorrect format.
Ø
Column
NULL Ratio: It shows the percentage of NULL values
in a column. This profile helps us to identify problems in our data, such as an
unexpectedly high ratio of null values in a column. For example, we profile a
Zip Code/Postal Code column and discover an unacceptably high percentage of
missing codes.
Ø
Column
Statistics: It displays the minimum, maximum, average, and
standard deviation of date and number values. This profile helps us to
identify problems in our data, such as dates that are not valid. For example,
we profile a column of historical dates and discover a maximum date that is in
the future.
Ø Column
Value Distribution
It identifies how many different values are present in
a column. This profile helps us to identify problems in our data, such as
an incorrect number of distinct values in a column. For example, we profile a
column that is supposed to contain states in the United States and discover
more than 50 distinct values.
Multi-Column| Table Profiles:
Ø Candidate Key: It reports whether a column or set of columns is a key, or
an approximate key, for the selected table. This profile also helps you
identify problems in your data, such as duplicate values in a potential key
column.
Ø
Functional
Dependency: It reports the extent to
which the values in one column (the dependent column) depend on the values in
another column or set of columns (the determinant column).
Ø Value Inclusion: It computes the overlap in the values between two columns
or sets of columns. This profile can determine whether a column or set of
columns is appropriate to serve as a foreign key between the selected tables.
Now it is ready to run. I am executing
the package.
Package executed successfully.
Now to see the result of profiling
click on the Data Profiling task and click on Open profiler
Data profiler viewer will be open
Here we see the profiler result.