Friday, 28 December 2018

Fuzzy Lookup in SSIS

SSIS Lookup Transform tasks retrieve only exact matches (Lookup Match) or nonmatches (Lookup No Match). To match items that are similar, but not exact, SSIS provides a different transform task called Fuzzy Lookup
For example, if I have a list of items that I want to look up in another list but they don't exactly match, I can use a Fuzzy Lookup for my Source table. The Fuzzy Lookup will connect to a 2nd table (called a Reference table) to retrieve and attempt to match values based on a percentage of "similarity" that you will provide in the task. I will have produced a list of potential matches for my Source Data that I can review.

Let’s see the example

Table script which is going to use in the demo example
--Reference table
CREATE TABLE [dbo].[Occupation](
    [OccupationID] [smallint] IDENTITY(1,1) NOT NULL,
    [OccupationLabel] [varchar](50) NOT NULL,
 CONSTRAINT [PK_Occupation_OccupationID] PRIMARY KEY CLUSTERED
(
    [OccupationID] ASC
) ON [PRIMARY]
) ON [PRIMARY]
            
GO
INSERT INTO [dbo].[Occupation] Select 'CUSTOMER SERVICE REPRESENTATIVE'
INSERT INTO [dbo].[Occupation] Select 'SHIFT LEADER'
INSERT INTO [dbo].[Occupation] Select 'ASSISTANT MANAGER'
INSERT INTO [dbo].[Occupation] Select 'STORE MANAGER'
INSERT INTO [dbo].[Occupation] Select 'DISTRICT MANAGER'
INSERT INTO [dbo].[Occupation] Select 'REGIONAL MANAGER'

--Destination table
CREATE TABLE [EmployeeRoster] (
    [EmployeeID] [smallint] IDENTITY(1,1) NOT NULL,
    [LastName] varchar(50),
    [FirstName] varchar(50),
    [OccupationID] smallint,
    [OccupationLabel] varchar(50)
)

We have a reference table where we stored the job position (designation).
   

And we are getting the files which we need to load on the destination table. Source files may contain some bad data also. If the source file data will match more than 60% then we will correct the source and take the value of the reference table value and load it into the destination.
See the flat file which is going to use.
   
 See the Title CUST SERVICE REP this is partially matched with the reference table. The value should be Customer service representative.
Taking Data flow task
   
   
Taking source as the flat file

       
   
Creating the file connection manager and configure the same.
Now here I am taking Look up and fuzzy lookup transformation
  

If the source exact match the record goes to the destination table. And the source is not an exact match then, in this case, we are checking the percentage of matching. If the Source matches 60% or more then it from the reference table, we will use the reference value and load it into the destination table.
  Configuring first Lookup transformation
       
    
If the source record does not match then unmatched record goes to Fuzzy look up.

                       

Configure the reference table. Now we move to the Columns table.
Map the column 

Now we are going to the Advance table.
   
 
Here we will do the setting for the fuzzy look up.
The maximum number of matches to output per lookup: Here we will specify the Maximum number of matches the Fuzzy Lookup Transformation may return. The default is 1.
Similarity threshold: Here we will specify similarity threshold ranges between 0 and 1 where 1 is an exact match. The Fuzzy Lookup Transformation Editor provides a slider to adjust the similarity between 0 and 1. If similarity threshold is closer to 1 then source column should match more accurately to reference data. In this example, I am taking the similarity threshold is .60 (60%)
Here we can define the token if we want
    

Click ok.
We will get the below two new extra column
    

_Similarity: Similarity score is between 0 and 1. It will show you, How much accurate the source data is matched with the fuzzy lookup data. For instance, 0.60 means source data is 60% matched with the destination.
_Confidence: How much confident Fuzzy Lookup Transformation is about the best match in the fuzzy lookup table. For instance, 0.60 means Fuzzy Lookup Transformation is 60% confident that, SHIFT LEADER is the best match for SHIFT LEAD.
Now I am taking Union all to merge both data (match and un-matched)
       

 Now taking the destination as oledb Destination.
     

Configuring the destination and mapping the column.
Now the package is ready to run.
Before run records in the table
      
 
Now I am running the package. Before running the package just I am placing the data viewer to see the data before loading data to the destination table.
   

Now ruining this package
    

Here similarity level is more than 60%
   
 

All 22 records are loaded into the destination table.
Now I am going to increase the similarity level to .80%. In this case, multiple records are not loaded into the destination.
  

Now running this package  
   
     
When we are running this package for performing the fuzzy loop up in the database it creates the temp table and once package completed successfully table has been deleted from the temp table.
See before running the package in the database
     

Now I am running this package.to keep running this package I am using Data Viewer.
  

See the table in temp DB
  

Temp table has been created to perform fuzzy loop up. Once package completed successfully temp table dropped automatically.
    
 
See the table in temp DB.

                           

 Where would you use the Fuzzy Lookup


The main uses for the Fuzzy Lookup that I have encountered have been matching supposedly common data from different data sources such as client lists and reduplicating reference data where items have been entered many times with slight variations. Essentially any time you need to tie up data that should be – but isn’t – the same, this is a very useful tool.

Point to remember:

  • If the Exhaustive property (in ADVANCED Editor) is set to False, the Fuzzy Lookup transformation returns only matches that have at least one indexed token or substring in common.
  • Set Exhaustive property to True for getting a more accurate result and for comparison of all the rows. If you set the Exhaustive property to True, the transformation compares every row in the input to every row in the reference table. This comparison algorithm may produce more accurate results, but it is likely to make the transformation perform more slowly unless the number of rows in the reference table is small.
  • Only input columns with the DT_WSTR and DT_STR data types can be used in fuzzy matching.
  • Fuzzy Lookup Transformation only use OLE DB Connection Manager to establish the connection with the Reference table
  • This transformation uses Reference table to compare the wrong word in source data with the correct word in the reference table.
  • Fuzzy Lookup allows us to use only DT_WSTR and DT_STR Data type column for Fuzzy Matching.
  • This transformation creates temporary tables and indexes in the SQL Server database at runtime.

No comments:

Post a Comment

If you have any doubt, please let me know.

Popular Posts