We are getting the data from 4 sources CSV Files. Some time we are getting the same records in more than one file. Our business wants to load the unique records in the main table. Basically we need to remove the duplicate data and load the unique records.
Files in folder.
Below are the files data.
Below is the table.
CREATE TABLE Employees( EmployeeID int NOT NULL primary key, LastName varchar(50) NOT NULL, FirstName varchar(50) NOT NULL, Title varchar(50) NULL, ) |
1.
We will read all files and merging the data
using union all transformation after that we will use sort transformation and
remove the data and load the unique records into the destination table.
2.
Other way like we load the all files in the
stage table using the Foreach each loop container after that we will take the execute SQL Task
and write the sql script to filter the duplicate records and loading the unique
records in the destination table.
In this demo we will see the first way to load the data.
Let’s see.
We are taking the data flow task.
In the dataflow task we are taking 4 flat files source and doing the configuration and doing the mapping.
Doing the mapping.
Now we are taking the union all transformation.
This wills
marge the data into the single dataset.
Read more
about: Union all Transformation
https://bageshkumarbagi-msbi.blogspot.com/2017/09/union-all-transformation-in-ssis.html
Need to check the key. We need to check mark the Remove rows with duplicate sort values to delete the
duplicate records.
Read More about: Sort Transformation
https://bageshkumarbagi-msbi.blogspot.com/2015/02/sort-transformation.html
Now taking OLEDB destination and doing the mapping.
Below is the complete transformation.
Now running the package.
Before run data in the table.
Running package now.
Package executed successfully.
See the records in the table.
Unique data loaded successfully.
Drawback of this approach is
like suppose we want add any new file then we need to modify the package.
Otherwise the new file will not load.
Now we are running this package.
It loads only 4 files. For loading the 5th file
we need to make the changes in the package then only we can load the data of
the 5th file.
Now we are removing one file from source.
Now running this package.
This package get failed
With the below error.
For loading the remaining files we make the changes in the
package then only we can load the data in the table.
As per my opinion: 2nd
approach is the best to handle such type of scenarios.
No comments:
Post a Comment
If you have any doubt, please let me know.