Saturday, 22 March 2014

Scenario 16: How to handle comma in a comma delimited file?





Scenario 16: How to handle comma in a comma delimited file?

Theory Analysis: There are multiple cases in this. We will examine each case one by one. First of all, we need to do data profiling for our source file.

What is Data Profiling?
Data Profiling is a validation process through which we are analyzing the data whether it's fit for source or not. If not then what changes needs to be done to make it fit for source or even sometimes we need to push it back to client with reason that data is not in a format which would be treated like a source as per requirement.

CASE 1: More than one Comma [,] is appearing just after delimiter [,]
CASE 2: Comma is [,] appearing in a data itself. Ex: Address column has data like this 'CA,USA'
CASE 3: Any other Char like [+] Char is appearing anywhere in data. [,]

Practically

CASE 1: More than one Comma [,] is appearing just after delimiter [,]

Data
CID,  CNAME, CITY, CNTRY,  PHONE NUMBER
1234,,,SMITH,CA,USA,,,9845679000
1235,JOHN,NY,USA,9996805944
1236,NORMA,,,CA,USA,,,7804567893
1237,CARL,,,NY,USA,889950493

Solution: It's very easy to handle. Informatica has already provided an option to handle this case.

While Importing File metadata of source file in Flat File Import wizard Step 2 of 3 Check Treat consecutive delimiter as one 





Now, it's good to go ahead with source file.

CASE 2: Comma is [,] appearing in a data itself. Ex: Address column has data like this 'CA,USA'

Data
CID, CNAME, ADD, PHONE_NUMBER
1234,SMITH,CA,USA,9845679000
1235,JOHN,NY,USA,9996805944
1236,NORMA,CA,USA,7804567893
1237,CARL,NY,USA,8899504934

Solution: There are two approach to resolve this issue.

Approach 1: Through Script modify your source and put Single Quotes or Double Quotes for each and every value.

Data After Modification
CID, CNAME, ADD, PHONE_NUMBER
'1234','SMITH','CA,USA','9845679000'
'1235','JOHN','NY,USA','9996805944'
'1236','NORMA','CA,USA','7804567893'
'1237','CARL','NY,USA','8899504934'

While Importing File metadata of in Flat File Import wizard Step 2 of 3 select Radio Button of Text Qualifier section Single Quotes. (If you have modified your source with double quotes then please select double quotes)





Approach 2: Read complete file as fixed width and in the expression with the help of SUBSTR() and INSTR() function split into columns. Only one thing we need to take care while splitting third column data we have to start from second occurrence of comma till fouth occurrence of comma.
(In many other scenarios we have already completed this.)

A real time Secret :-) We are creating Reusable transformation for splitting records into column based upon delimited char and we are using that whenever we need.
A second case where we are dependent upon reusable transformation is for removing unwanted space from data while reading from files. Yes,  LTRM(RTRIM())




CASE 3: Any other Char like [+] Char is appearing anywhere in data. [,]
CID, CNAME, ADD, PHONE_NUMBER
1234,SMITH,CA,USA,+19845679000
1235,JOHN,NY,USA,+19996805944
1236,NORMA,CA,USA,+17804567893
1237,CARL,NY,USA,+18899504934

Solution: While Importing File metadata of in Flat File Import wizard Step 2 of 3 Check Remove escape character from data and give you Char + in Text Box.


BINGO!!!

Feel free to post your doubt in comment section.

Due to limitation, I can't attach exported .xml file which you can Import in your repository and see or analyse. (Feel free to mail me if you need .xml file with Scenario number and Question.)

If you like the approach and solution feel free to share with others.

Thanks.
hpandey.harsh@gmail.com

12 comments:

  1. Hi Harsha,
    Here in the Case3 we are eliminating (+) symbol but Before phone number we have +19866...........So we should eliminate 1 also here right? How can we achieve this?

    Thanks in Advance

    ReplyDelete
    Replies
    1. 1 is country code for US. Just like for India we have 91. It's totally depends upon your requirement if you don't want to load country code 1 then you have to replace it with NULL through REPLACECHAR function in Expression T/R in mapping but before doing this take care about data type as phone number is number so, first convert it to CHAR then use REPLACECHAR then once again convert it to NUMBER...

      Delete
  2. Hi Harsha ,
    Your solution for the Delimeter in the data looks fine , but i is a fix for root problem ..what if the situation that we dunno where the delimiter will come as a part of Data. How can we handle this as Generic Solution. Any Idea or suggestions ? Thanks ! Chetana Naidu

    ReplyDelete
    Replies
    1. Read/Implement CASE2, Approach 1. It will handle the comma if it's appears anywhere in the data.

      Thanks.

      Delete
    2. Hi,

      Reagrding the Case 2 Approach1,it was mentioned we have to add either single or double quotes for each and every through script.Can you help me out with that script in UNIX.

      Delete
  3. Hi Harsh ,
    Thanks for the reply . But again I have doubt like ..how will our script can distinguish ..whether its a data or delimeter ? Thanks! Chetana Naidu

    ReplyDelete
  4. Hope you have read the post carefully. In the beginning of the post I have explained little bit about Data Profiling. So, you have to do the data profiling before creating your script.
    Thanks.

    ReplyDelete
  5. if there is multipal delimetes what will do ? just like |,.,'

    ReplyDelete
  6. Hello Harsha,

    My source table is from data warehouse. My target is a .csv file. The source column data contains Commas and it is resulting in additional fields in my .csv file due to delimiter as comma.

    I have created my target flatfile in Informatica and not imported from local. So approach-2 will not be helpful for me.

    can you help me out how could i eliminate comma from column data.

    I have tried using REPLACECHAR() function in Source qualifier query, but its throwing me a invalid identifier error. Should i use this function in expression transformation instead of SQ query??

    Appreciate your quick reply.

    Thanks,
    Nagarjuna

    ReplyDelete
  7. Source: Multi record format
    Target: 3 tables Base, J1, J2

    1 row contains multiple fixed width records and dynamic in nature, need to load data into target tables.

    Eg:- (Base Record 100 length [c1 string 50, c2 string 50] +J1 Record 200 length [c1 string 50, c2 string 50, c3 string50, c4 string 50]+J2 Record length 150 [c1 string 50, c2 string 100])







    Base Record + J1 + J2+ J1+ J2+J1 -> 1 Record goes to base table, 3 records goes to J1 table , 2 records goes to J2 table
    Base Record + J2 + J2+ J1+ J2+J1 -> 1 Record goes to base table, 2 records goes to J1 table , 3 records goes to J2 table


    ReplyDelete
  8. Can you please help me on this.I'm new to informatica.

    ReplyDelete
  9. I feel Informatica is vivid and deep to introspect and hence I genuinely feel one should look up more about it.

    Informatica Read JSON

    ReplyDelete