Change Data Capture (CDC) / MD5

What is Change Data Capture?
Change Data Capture is a design which is used to Capture only those data which are newly inserted/updated in Source from previous read.

Mostly we achieved this by maintaining Date or Flag in Source Table to indicate inserted/updated status.

Challenge

The Challenge is implementing CDC on such type of source where there is no Date/Flag to indicate inserted/updated records. 

Note: Here we are not talking about date columns in table such as HIRE_DATE, Date_of_Birth. We are talking about an additional column which we are maintaining only for CDC. A date on which record is inserted or updated. For understanding we can say EFFECTIVE DATE.

MD5:

Calculates the checksum of the input value. The function uses Message-Digest algorithm 5 (MD5). MD5 is a one-way cryptographic hash function with a 128-bit hash value. You can conclude that input values are different when the checksums of the input values are different. (Copy Paste of Bible ;-) )

In my words.

Informatica is a great friend of us for Data Integration job. For defeating CDC Challenge he has used Message-Digest-Algorithm 5 to develop MD5 () function.
  1. Checksum Value: Checksum Value is a value which is returned by MD5 function.
  2. MD5() function named on algorithm it's used (Message-Digest_Algorithm 5)
  3. MD5() function works for only Char/Binary data.
  4. MD5() return a value of 32 char (which is nothing but a Hexadecimial value means in return you will see combination of 0-9 and a-f.
  5. MD5() return value also represented as Hash Key. In Real Environment mostly we are using this word ;-)
  6. MD5() returns NULL if passing value is also NULL

Ex: Return Value of MD5(America), MD5(AmErica), MD5(america) would be different. Means if any of the character in string has changed then MD5 return value will be changed.

How does MD5() works?
MD5 function used cryptograhic hash function producing a 128-bit hash value which is 32 char long.
If checksum value of previous read  and current read of record is changed means data has been updated.


Pheeeew....zzzzzzz, hope I am successful by now in explaining you little bit about MD5.. Let me have a cup of coffee before next topic. :-)

Definitely we will check this practically... No Verbal Solution..... Do it Practically :-) 
Check Informatica Scenarios Label.
http://informaticachamp.blogspot.in/2014/03/scenario-10-implementing-scd1-using-md5.html

Thanks
hpandey.harsh@gmail.com

Feel Free to post you doubt as well as your marks regarding my efforts in comment section. :-)

4 comments:

  1. super explanation
    ...keep doing.

    ReplyDelete
  2. Thank you for the nice article but if you have used it in any project, can you please confirm if MD5 is perfect to implement CDC or there are any exceptions? Somewhere I read that if the MD5 value is different, then the Original and Current records are surely different, but if the MD5 values are same, it doesn't mean the Original and Current records are the same.

    ReplyDelete