Wednesday 12 March 2014

Exercise 3: How to remove duplicates? If source is a flat file

Exercise 3: How to remove duplicates? If source is a flat file
OR
How to send distinct records to one target and duplicates to others

Theory/Analysis: When records are repeating then we are saying table have duplicates. We can achieve this by comparing current record and previous record. If it's matched then record would be flagged as duplicate record otherwise it's a distinct record. Yes, we need to sort the data first before comparing current and previous record.

Practically:

Step 1:
Drag and Drop your source/target to mapping designer.

Step 2: 
Sorter
Sort the data based upon port for which you are finding duplicate.

EMPNO     Key (Check)

Step 3:
Expression
Create three ports.
EMPNO_COMPARE (Variable Port) = IIF(EMPNO=OLD_REC, 1, 0)
OLD_EMPNO  (Variable Port ) = EMPNO
DUP_RECORD (Variable Port ) = EMPNO_COMPARE

Step 4:
Router
Create two Groups
New_Record: DUP_RECORD = 0
Dup_Record: DUP_RECORD = 1

Step 5:
Connect New_Record group to distinct target and Dup_Record group ports to duplicate target.

BINGO!!!

NOTE: Above solution will eliminate only column level duplicates means if only EMPNO is repeating then only it will filter it out to duplicate target table.
This exercise is best exercise is to understand that how variable ports works and how the records processed through expression and ports. (Use Debugger and analyse it.)



Feel free to post your doubt in comment section.

Due to limitation, I can't attach exported .xml file which you can Import in your repository and see or analyse. (Feel free to mail me if you need .xml file with Scenario number and Question.)

If you like the approach and solution feel free to share with others.

Thanks.
hpandey.harsh@gmail.com

1 comment:

  1. Another logic for the above scenario:
    v_count=IIF(empno=v_prev,v_count+1,1)
    o_count=v_count
    v_prev=empno

    ReplyDelete